Mounting the Google Drive¶

In [ ]:
# Access to the file that is available on Google Drive
print("Mounting Google Drive...")
from google.colab import drive
drive.mount('/content/drive')

!mkdir /content/drive/MyDrive/data_processing/
download_folder = "/content/drive/MyDrive/CSULB THESIS 2025/Thesis Dataset"
Mounting Google Drive...
Mounted at /content/drive
mkdir: cannot create directory ‘/content/drive/MyDrive/data_processing/’: File exists

Import the libaries¶

In [ ]:
# for basic operations
import numpy as np
import pandas as pd
from scipy import stats

# for visualizations
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import rcParams
# plt.style.use('fivethirtyeight')
from mpl_toolkits.mplot3d import Axes3D # Import for 3D plotting
import plotly.express as px

# for modeling
from xgboost.sklearn import XGBClassifier
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV, cross_val_score, train_test_split
# from imblearn.over_sampling import SMOTE
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression, Lasso
from sklearn.svm import OneClassSVM
from sklearn.model_selection import train_test_split

from sklearn.manifold import TSNE
from sklearn.cluster import KMeans, DBSCAN

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.feature_selection import RFECV
from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.metrics import make_scorer, matthews_corrcoef
from sklearn.model_selection import StratifiedKFold, cross_val_score, cross_val_predict, cross_validate

from sklearn.metrics import precision_recall_curve, matthews_corrcoef, make_scorer
from sklearn.metrics import (
    confusion_matrix, f1_score, average_precision_score,
    precision_recall_curve, ConfusionMatrixDisplay
)

# Autoencoder
import torch
import torch.nn as nn
import torch.optim as optim

from imblearn.over_sampling import SMOTE

# to avoid warnings
import warnings
warnings.filterwarnings('ignore')
warnings.warn("this will not show")
In [ ]:
import torch
import random

# Define your desired random seed
MANUAL_SEED = 42

def set_all_seeds(seed):
    """Sets seed for reproducibility across all relevant libraries."""
    # 1. NumPy seed
    np.random.seed(seed)

    # 2. Python standard random module seed
    random.seed(seed)

    # 3. PyTorch seeds
    torch.manual_seed(seed)

    # 4. PyTorch CUDA/GPU seeds (if available)
    if torch.cuda.is_available():
        torch.cuda.manual_seed(seed)
        torch.cuda.manual_seed_all(seed)  # for multi-GPU

        # Optional: Disable non-deterministic operations for better reproducibility
        torch.backends.cudnn.deterministic = True
        torch.backends.cudnn.benchmark = False

# Call the function to set the seeds
set_all_seeds(MANUAL_SEED)

print(f"Random seed set to {MANUAL_SEED} for all operations.")
Random seed set to 42 for all operations.

Inventory & understand the data¶

About Dataset Context Manufacturing process feature selection and categorization

Content Abstract: Data from a semi-conductor manufacturing process

Data Set Characteristics: Multivariate Number of Instances: 1567 Area: Computer Attribute Characteristics: Real Number of Attributes: 591 Date Donated: 2008-11-19 Associated Tasks: Classification, Causal-Discovery Missing Values? Yes A complex modern semi-conductor manufacturing process is normally under consistent surveillance via the monitoring of signals/variables collected from sensors and or process measurement points. However, not all of these signals are equally valuable in a specific monitoring system. The measured signals contain a combination of useful information, irrelevant information as well as noise. It is often the case that useful information is buried in the latter two. Engineers typically have a much larger number of signals than are actually required. If we consider each type of signal as a feature, then feature selection may be applied to identify the most relevant signals. The Process Engineers may then use these signals to determine key factors contributing to yield excursions downstream in the process. This will enable an increase in process throughput, decreased time to learning and reduce the per unit production costs.

To enhance current business improvement techniques the application of feature selection as an intelligent systems technique is being investigated.

The dataset presented in this case represents a selection of such features where each example represents a single production entity with associated measured features and the labels represent a simple pass/fail yield for in house line testing, figure 2, and associated date time stamp. Where .1 corresponds to a pass and 1 corresponds to a fail and the data time stamp is for that specific test point.

Using feature selection techniques it is desired to rank features according to their impact on the overall yield for the product, causal relationships may also be considered with a view to identifying the key features.

Results may be submitted in terms of feature relevance for predictability using error rates as our evaluation metrics. It is suggested that cross validation be applied to generate these results. Some baseline results are shown below for basic feature selection techniques using a simple kernel ridge classifier and 10 fold cross validation.

Baseline Results: Pre-processing objects were applied to the dataset simply to standardize the data and remove the constant features and then a number of different feature selection objects selecting 40 highest ranked features were applied with a simple classifier to achieve some initial results. 10 fold cross validation was used and the balanced error rate (*BER) generated as our initial performance metric to help investigate this dataset.

SECOM Dataset: 1567 examples 591 features, 104 fails

Import dataset

In [ ]:
secom = pd.read_csv("/content/drive/MyDrive/CSULB THESIS 2025/Thesis Dataset/uci-secom.csv")
secom.head()
Out[ ]:
Time 0 1 2 3 4 5 6 7 8 ... 581 582 583 584 585 586 587 588 589 Pass/Fail
0 2008-07-19 11:55:00 3030.93 2564.00 2187.7333 1411.1265 1.3602 100.0 97.6133 0.1242 1.5005 ... NaN 0.5005 0.0118 0.0035 2.3630 NaN NaN NaN NaN -1
1 2008-07-19 12:32:00 3095.78 2465.14 2230.4222 1463.6606 0.8294 100.0 102.3433 0.1247 1.4966 ... 208.2045 0.5019 0.0223 0.0055 4.4447 0.0096 0.0201 0.0060 208.2045 -1
2 2008-07-19 13:17:00 2932.61 2559.94 2186.4111 1698.0172 1.5102 100.0 95.4878 0.1241 1.4436 ... 82.8602 0.4958 0.0157 0.0039 3.1745 0.0584 0.0484 0.0148 82.8602 1
3 2008-07-19 14:43:00 2988.72 2479.90 2199.0333 909.7926 1.3204 100.0 104.2367 0.1217 1.4882 ... 73.8432 0.4990 0.0103 0.0025 2.0544 0.0202 0.0149 0.0044 73.8432 -1
4 2008-07-19 15:22:00 3032.24 2502.87 2233.3667 1326.5200 1.5334 100.0 100.3967 0.1235 1.5031 ... NaN 0.4800 0.4766 0.1045 99.3032 0.0202 0.0149 0.0044 73.8432 -1

5 rows × 592 columns

In [ ]:
secom.describe()
Out[ ]:
0 1 2 3 4 5 6 7 8 9 ... 581 582 583 584 585 586 587 588 589 Pass/Fail
count 1561.000000 1560.000000 1553.000000 1553.000000 1553.000000 1553.0 1553.000000 1558.000000 1565.000000 1565.000000 ... 618.000000 1566.000000 1566.000000 1566.000000 1566.000000 1566.000000 1566.000000 1566.000000 1566.000000 1567.000000
mean 3014.452896 2495.850231 2200.547318 1396.376627 4.197013 100.0 101.112908 0.121822 1.462862 -0.000841 ... 97.934373 0.500096 0.015318 0.003847 3.067826 0.021458 0.016475 0.005283 99.670066 -0.867262
std 73.621787 80.407705 29.513152 441.691640 56.355540 0.0 6.237214 0.008961 0.073897 0.015116 ... 87.520966 0.003404 0.017180 0.003720 3.578033 0.012358 0.008808 0.002867 93.891919 0.498010
min 2743.240000 2158.750000 2060.660000 0.000000 0.681500 100.0 82.131100 0.000000 1.191000 -0.053400 ... 0.000000 0.477800 0.006000 0.001700 1.197500 -0.016900 0.003200 0.001000 0.000000 -1.000000
25% 2966.260000 2452.247500 2181.044400 1081.875800 1.017700 100.0 97.920000 0.121100 1.411200 -0.010800 ... 46.184900 0.497900 0.011600 0.003100 2.306500 0.013425 0.010600 0.003300 44.368600 -1.000000
50% 3011.490000 2499.405000 2201.066700 1285.214400 1.316800 100.0 101.512200 0.122400 1.461600 -0.001300 ... 72.288900 0.500200 0.013800 0.003600 2.757650 0.020500 0.014800 0.004600 71.900500 -1.000000
75% 3056.650000 2538.822500 2218.055500 1591.223500 1.525700 100.0 104.586700 0.123800 1.516900 0.008400 ... 116.539150 0.502375 0.016500 0.004100 3.295175 0.027600 0.020300 0.006400 114.749700 -1.000000
max 3356.350000 2846.440000 2315.266700 3715.041700 1114.536600 100.0 129.252200 0.128600 1.656400 0.074900 ... 737.304800 0.509800 0.476600 0.104500 99.303200 0.102800 0.079900 0.028600 737.304800 1.000000

8 rows × 591 columns

In [ ]:
# 1. Count the values and map the labels for the plot index
counts = secom['Pass/Fail'].value_counts().rename({
    -1: 'Pass (Good)',
    1: 'Fail (Defective)'
})

# 2. Create the plot
plt.figure(figsize=(6, 4))
counts.plot(kind='bar', color=['green', 'red'], rot=0)

# 3. Add titles and show
plt.title('Product Status Distribution')
plt.ylabel('Count')
plt.show()

print("\n--- Numerical Class Counts ---")
print(counts)
No description has been provided for this image
--- Numerical Class Counts ---
Pass/Fail
Pass (Good)         1463
Fail (Defective)     104
Name: count, dtype: int64

1. Ingest & clean — types, missing values, outliers, dedupe, normalize categories/units, imbalanced data¶

Check missing value¶

Address missing value

  • Feature removal: Variables with more than 30% missing values are removed from the dataset.
  • Row removal: missing values of a variable with less than 5% of the total observations are dropped.
  • Imputation: Remaining missing entries are imputed using mean or median based on the variable distribution (normal or skewed distribution respectively).
In [ ]:
# --- 1. Check missingness per column (counts and %) ---
missing_counts = secom.isna().sum()
missing_percent = missing_counts / len(secom) * 100  # percent missing per column

print("Missing values per column:")
print(missing_counts)
print("\nNumber of columns with any missing values:",
      (missing_counts > 0).sum())
Missing values per column:
Time          0
0             6
1             7
2            14
3            14
             ..
586           1
587           1
588           1
589           1
Pass/Fail     0
Length: 592, dtype: int64

Number of columns with any missing values: 538
In [ ]:
# --- 2. Visualize distribution of % missing per column ---
plt.figure(figsize=(10, 6))
plt.hist(
    missing_percent,
    bins=20,
    edgecolor='black'
)
plt.title('Distribution of % Missing per Column in SECOM Dataset', fontsize=14)
plt.xlabel('Percent Missing in a Column', fontsize=12)
plt.ylabel('Count of Columns', fontsize=12)
plt.grid(axis='y', alpha=0.5)
plt.show()
No description has been provided for this image
In [ ]:
# --- 3. Feature removal: drop variables with >30% missing values ---
feature_thresh = 30  # percent
cols_high_missing = missing_percent[missing_percent > feature_thresh].index

print(f"\nNumber of columns dropped (>{feature_thresh}% missing): {len(cols_high_missing)}")
print("Columns dropped:", list(cols_high_missing))

secom = secom.drop(columns=cols_high_missing)
Number of columns dropped (>30% missing): 32
Columns dropped: ['72', '73', '85', '109', '110', '111', '112', '157', '158', '220', '244', '245', '246', '247', '292', '293', '345', '346', '358', '382', '383', '384', '385', '492', '516', '517', '518', '519', '578', '579', '580', '581']
In [ ]:
# --- 4. Row removal: for variables with <5% missing, drop their missing rows ---
# Recompute missingness after dropping high-missing columns (optional, but clean)
missing_counts = secom.isna().sum()
missing_percent = missing_counts / len(secom) * 100

row_thresh = 5  # percent
cols_for_row_drop = missing_percent[missing_percent < row_thresh].index

print(f"\nColumns used for row-wise NA removal (<{row_thresh}% missing): {len(cols_for_row_drop)}")
print("Columns for row drop:", list(cols_for_row_drop))

# Drop rows where these "almost complete" columns have missing values
secom = secom.dropna(subset=cols_for_row_drop)
print("\nFinal shape after cleaning:", secom.shape)
Columns used for row-wise NA removal (<5% missing): 540
Columns for row drop: ['Time', '0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20', '21', '22', '23', '24', '25', '26', '27', '28', '29', '30', '31', '32', '33', '34', '35', '36', '37', '38', '39', '40', '41', '42', '43', '44', '45', '46', '47', '48', '49', '50', '51', '52', '53', '54', '55', '56', '57', '58', '59', '60', '61', '62', '63', '64', '65', '66', '67', '68', '69', '70', '71', '74', '75', '76', '77', '78', '79', '80', '81', '82', '83', '84', '86', '87', '88', '89', '90', '91', '92', '93', '94', '95', '96', '97', '98', '99', '100', '101', '102', '103', '104', '105', '106', '107', '108', '113', '114', '115', '116', '117', '118', '119', '120', '121', '122', '123', '124', '125', '126', '127', '128', '129', '130', '131', '132', '133', '134', '135', '136', '137', '138', '139', '140', '141', '142', '143', '144', '145', '146', '147', '148', '149', '150', '151', '152', '153', '154', '155', '156', '159', '160', '161', '162', '163', '164', '165', '166', '167', '168', '169', '170', '171', '172', '173', '174', '175', '176', '177', '178', '179', '180', '181', '182', '183', '184', '185', '186', '187', '188', '189', '190', '191', '192', '193', '194', '195', '196', '197', '198', '199', '200', '201', '202', '203', '204', '205', '206', '207', '208', '209', '210', '211', '212', '213', '214', '215', '216', '217', '218', '219', '221', '222', '223', '224', '225', '226', '227', '228', '229', '230', '231', '232', '233', '234', '235', '236', '237', '238', '239', '240', '241', '242', '243', '248', '249', '250', '251', '252', '253', '254', '255', '256', '257', '258', '259', '260', '261', '262', '263', '264', '265', '266', '267', '268', '269', '270', '271', '272', '273', '274', '275', '276', '277', '278', '279', '280', '281', '282', '283', '284', '285', '286', '287', '288', '289', '290', '291', '294', '295', '296', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '309', '310', '311', '312', '313', '314', '315', '316', '317', '318', '319', '320', '321', '322', '323', '324', '325', '326', '327', '328', '329', '330', '331', '332', '333', '334', '335', '336', '337', '338', '339', '340', '341', '342', '343', '344', '347', '348', '349', '350', '351', '352', '353', '354', '355', '356', '357', '359', '360', '361', '362', '363', '364', '365', '366', '367', '368', '369', '370', '371', '372', '373', '374', '375', '376', '377', '378', '379', '380', '381', '386', '387', '388', '389', '390', '391', '392', '393', '394', '395', '396', '397', '398', '399', '400', '401', '402', '403', '404', '405', '406', '407', '408', '409', '410', '411', '412', '413', '414', '415', '416', '417', '418', '419', '420', '421', '422', '423', '424', '425', '426', '427', '428', '429', '430', '431', '432', '433', '434', '435', '436', '437', '438', '439', '440', '441', '442', '443', '444', '445', '446', '447', '448', '449', '450', '451', '452', '453', '454', '455', '456', '457', '458', '459', '460', '461', '462', '463', '464', '465', '466', '467', '468', '469', '470', '471', '472', '473', '474', '475', '476', '477', '478', '479', '480', '481', '482', '483', '484', '485', '486', '487', '488', '489', '490', '491', '493', '494', '495', '496', '497', '498', '499', '500', '501', '502', '503', '504', '505', '506', '507', '508', '509', '510', '511', '512', '513', '514', '515', '520', '521', '522', '523', '524', '525', '526', '527', '528', '529', '530', '531', '532', '533', '534', '535', '536', '537', '538', '539', '540', '541', '542', '543', '544', '545', '558', '559', '560', '561', '570', '571', '572', '573', '574', '575', '576', '577', '582', '583', '584', '585', '586', '587', '588', '589', 'Pass/Fail']

Final shape after cleaning: (1393, 560)
In [ ]:
cols_to_plot = ["10", "40"]

# Loop to generate plots for each specified column
for col in cols_to_plot:
    # Drop NaN values, as they cannot be plotted in a histogram or boxplot
    clean_data = secom[col].dropna()

    # Create a figure with two subplots side-by-side (1 row, 2 columns)
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # 1. Histogram (Distribution Plot)
    axes[0].hist(clean_data, bins=30, edgecolor='black', color='teal', alpha=0.7)
    axes[0].set_title(f'Histogram of Feature {col} Values')
    axes[0].set_xlabel(f'Feature {col} Value')
    axes[0].set_ylabel('Frequency (Instance Count)')
    axes[0].grid(axis='y', linestyle='--', alpha=0.7)

    # 2. Boxplot (Distribution Summary)
    axes[1].boxplot(clean_data, patch_artist=True,
                    boxprops=dict(facecolor='lightcoral', color='black'),
                    medianprops=dict(color='black'))
    axes[1].set_title(f'Boxplot for Feature {col}')
    axes[1].set_ylabel(f'Feature {col} Value')
    axes[1].set_xticks([1], [f'Feature {col}']) # Label the single box
    axes[1].grid(axis='y', linestyle='--', alpha=0.7)

    # Adjust layout and save the figure
    plt.tight_layout()
    plt.savefig(f'feature_{col}_hist_boxplot.png')
    plt.show()
    plt.close(fig)
No description has been provided for this image
No description has been provided for this image

Imputing a summary statistic¶

Check the columns that still have missing values

In [ ]:
cols_with_missing_values = secom.columns[secom.isna().sum()>0]
print(cols_with_missing_values)
secom[cols_with_missing_values]
Index(['546', '547', '548', '549', '550', '551', '552', '553', '554', '555',
       '556', '557', '562', '563', '564', '565', '566', '567', '568', '569'],
      dtype='object')
Out[ ]:
546 547 548 549 550 551 552 553 554 555 556 557 562 563 564 565 566 567 568 569
1 1.3526 408.798 74.640 0.7193 16.00 1.33 0.2829 7.1196 0.4989 53.1836 3.9139 1.7819 NaN NaN NaN NaN NaN NaN NaN NaN
2 0.7942 411.136 74.654 0.1832 16.16 0.85 0.0857 7.1619 0.3752 23.0713 3.9306 1.1386 267.064 0.9032 1.10 0.6219 0.4122 0.2562 0.4119 68.8489
3 1.1650 372.822 72.442 1.8804 131.68 39.33 0.6812 56.9303 17.4781 161.4081 35.3198 54.2917 268.228 0.6511 7.32 0.1630 3.5611 0.0670 2.7290 25.0363
4 1.4636 399.914 79.156 1.0388 19.63 1.98 0.4287 9.7608 0.8311 70.9706 4.9086 2.5014 NaN NaN NaN NaN NaN NaN NaN NaN
5 1.2708 412.222 80.326 1.1050 16.06 1.65 0.5329 7.2448 0.6268 86.9463 3.8960 2.0541 254.006 0.8444 4.75 0.1905 1.8784 0.0743 1.8700 22.5598
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1537 0.8273 400.814 73.254 0.2235 14.53 1.33 0.0949 6.7381 0.5058 27.0176 3.6251 1.8156 270.038 0.7235 9.93 0.1420 3.9802 0.0579 3.6773 19.6342
1539 1.0795 404.148 73.766 1.0847 17.66 1.93 0.4177 8.2522 0.6858 100.4837 4.3697 2.6164 270.276 0.8724 8.79 0.2157 3.9604 0.0892 3.2522 24.7258
1540 1.1302 408.646 74.854 0.9679 10.91 0.76 0.4492 4.9069 0.2863 85.6474 2.6698 1.0153 248.652 0.7768 8.17 0.1185 3.0952 0.0493 3.2857 15.2498
1541 0.8273 400.814 73.254 0.2235 14.53 1.33 0.0949 6.7381 0.5058 27.0176 3.6251 1.8156 264.272 0.5671 4.98 0.0877 2.0902 0.0382 1.8844 15.4662
1550 0.5486 397.310 72.228 0.3441 19.10 1.39 0.1378 8.4989 0.5323 62.7106 4.8073 1.9245 264.272 0.5671 4.98 0.0877 2.0902 0.0382 1.8844 15.4662

1393 rows × 20 columns

Check the missing value varaibles' distribution

  • Normal distribution
  • Skewed distribution
In [ ]:
def check_normality_and_kurtosis(
    df,
    cols=None,
    alpha=0.05,
    shapiro_max=5000,   # Shapiro warning threshold (~5000); SECOM has n=1567 so this won't trigger
    min_n=8,            # minimum n to attempt Shapiro
    kurt_min_n=20,      # kurtosis test is reliable for n >= ~20
    random_state=0
):
    """
    For each numeric column (from `cols` or all df columns):
      - Drop NaNs, then run Shapiro–Wilk for normality (p-value >= alpha => 'normal').
      - Run kurtosis test for departure from normal kurtosis (p-value < alpha => 'skewed' per your spec).
    Returns (results_df, normal_vars, skewed_vars) and also prints the two lists.
    """
    cols_iter = list(df.columns) if cols is None else list(cols)
    rng = np.random.default_rng(random_state)
    rows = []

    for col in cols_iter:
        s = df[col]

        # Only numeric columns
        if not pd.api.types.is_numeric_dtype(s):
            continue

        x = s.dropna().to_numpy()
        n = x.size
        miss = s.isna().sum()

        # Skip tiny or constant series
        if n < min_n:
            rows.append(dict(col=col, n=n, missing=miss, note=f"skipped: n<{min_n}"))
            continue
        if np.isclose(np.nanstd(x), 0.0):
            rows.append(dict(col=col, n=n, missing=miss, note="skipped: constant"))
            continue

        # Descriptive skew (for direction only)
        skew_val = stats.skew(x, bias=False)

        # Shapiro–Wilk (cap to shapiro_max if ever needed)
        try:
            x_sh = x if n <= shapiro_max else rng.choice(x, size=shapiro_max, replace=False)
            sh_W, sh_p = stats.shapiro(x_sh)
        except Exception:
            sh_W, sh_p = np.nan, np.nan

        # Kurtosis test (excess kurtosis == 0 under normal)
        kurt_z, kurt_p = np.nan, np.nan
        if n >= kurt_min_n:
            try:
                kurt_z, kurt_p = stats.kurtosistest(x)
            except Exception:
                pass

        rows.append({
            "col": col,
            "n": n,
            "missing": miss,
            "skew": skew_val,             # descriptive (sign gives left/right)
            "shapiro_W": sh_W,
            "shapiro_p": sh_p,
            "kurt_z": kurt_z,
            "kurt_p": kurt_p,
            "note": ""
        })

    res = pd.DataFrame(rows).set_index("col").sort_index()

    # Classifications per your request
    normal_mask = res["shapiro_p"] >= alpha
    skewed_mask = res["kurt_p"] < alpha

    normal_vars = res.index[normal_mask.fillna(False)].tolist()
    skewed_vars = res.index[skewed_mask.fillna(False)].tolist()

    # ---- Print summaries ----
    print(f"Normally distributed by Shapiro–Wilk (p ≥ {alpha}): {len(normal_vars)}")
    print(normal_vars)

    print(f"\n'Skewed' by kurtosis test (p < {alpha}): {len(skewed_vars)}")
    print(skewed_vars)

    # Optional: show skew direction for the kurtosis-flagged vars
    if len(skewed_vars) > 0:
        dir_tbl = res.loc[skewed_vars, ["skew", "kurt_z", "kurt_p"]].sort_values("skew")
        print("\nSkew direction (negative = left, positive = right) for kurtosis-flagged vars:")
        print(dir_tbl)

    return res, normal_vars, skewed_vars

# ---------- Usage on SECOM: only variables with missing values ----------
cols_with_missing_values = secom.columns[secom.isna().sum() > 0]
results, normal_vars, skewed_vars = check_normality_and_kurtosis(
    secom,
    cols=cols_with_missing_values,
    alpha=0.05
)
Normally distributed by Shapiro–Wilk (p ≥ 0.05): 0
[]

'Skewed' by kurtosis test (p < 0.05): 20
['546', '547', '548', '549', '550', '551', '552', '553', '554', '555', '556', '557', '562', '563', '564', '565', '566', '567', '568', '569']

Skew direction (negative = left, positive = right) for kurtosis-flagged vars:
          skew     kurt_z         kurt_p
col                                     
547  -0.444275   6.407610   1.478182e-10
562   0.133034   8.731383   2.515772e-18
548   0.849374  -8.559926   1.129400e-17
563   0.875335   3.499987   4.652811e-04
546   1.250570   8.887123   6.271149e-19
555   1.715929  12.157176   5.254390e-34
564   1.812195  14.132145   2.407008e-45
568   1.868507  14.660421   1.155540e-48
565   2.110839  13.034970   7.739770e-39
566   2.115757  15.807597   2.757860e-56
569   2.117241  12.724327   4.331909e-37
567   2.226898  13.519533   1.199349e-41
552   3.665908  17.340619   2.322215e-67
549   3.837846  17.678926   6.094651e-70
553  11.068557  23.415281  2.986535e-121
550  12.128992  23.610078  3.036858e-123
556  13.512496  23.874108  5.690932e-126
551  20.480947  24.640177  4.690008e-134
557  20.972181  24.678778  1.807598e-134
554  20.983550  24.692628  1.283435e-134

Conclusion: There are 0 variables with normal distribution and 20 missing value variables are skewed distribution.

In this next step, I want to fill missing values from skewed distributed variables with their median value.

In [ ]:
num_cols = [c for c in skewed_vars if pd.api.types.is_numeric_dtype(secom[c])]
medians = secom[num_cols].median()                # Series of medians per column
secom[num_cols] = secom[num_cols].fillna(medians) # column-wise fill
In [ ]:
# -1 is Pass, 1 is fail
secom["Pass/Fail"].value_counts()
Out[ ]:
count
Pass/Fail
-1 1294
1 99

In [ ]:
secom.describe()
Out[ ]:
0 1 2 3 4 5 6 7 8 9 ... 577 582 583 584 585 586 587 588 589 Pass/Fail
count 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.0 1393.000000 1393.000000 1393.000000 1393.000000 ... 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000
mean 3014.966274 2494.861981 2200.348900 1384.417093 2.928322 100.0 101.231608 0.122246 1.465172 -0.000861 ... 16.434664 0.500049 0.015366 0.003858 3.078235 0.021743 0.016454 0.005279 98.088611 -0.857861
std 74.195555 81.555556 29.825001 415.755060 42.128656 0.0 5.900409 0.004998 0.072696 0.015069 ... 12.336507 0.003426 0.018124 0.003930 3.776211 0.012602 0.008801 0.002865 92.838412 0.514067
min 2743.240000 2158.750000 2060.660000 0.000000 0.681500 100.0 82.131100 0.000000 1.191000 -0.053400 ... 4.582000 0.477800 0.006000 0.001700 1.197500 -0.006000 0.003200 0.001000 0.000000 -1.000000
25% 2966.240000 2450.890000 2180.966600 1084.377900 1.016000 100.0 98.131100 0.121100 1.412500 -0.010800 ... 11.280000 0.497900 0.011600 0.003100 2.305500 0.013600 0.010600 0.003300 44.175400 -1.000000
50% 3011.840000 2498.770000 2201.066700 1285.214400 1.307600 100.0 101.666700 0.122400 1.462900 -0.001300 ... 13.733200 0.500100 0.013800 0.003600 2.760200 0.020800 0.014800 0.004600 71.084200 -1.000000
75% 3057.450000 2538.740000 2218.577800 1588.509000 1.518800 100.0 104.586700 0.123800 1.518600 0.008500 ... 17.002800 0.502300 0.016400 0.004100 3.284400 0.027700 0.020300 0.006400 113.550600 -1.000000
max 3356.350000 2846.440000 2315.266700 3715.041700 1114.536600 100.0 129.252200 0.128600 1.656400 0.074900 ... 96.960100 0.509800 0.476600 0.104500 99.303200 0.102800 0.079900 0.028600 737.304800 1.000000

8 rows × 559 columns

Applying clustering method for the original data¶

Data Preparation and Scaling for the cluster¶

In [ ]:
# Copy the dataset
secom_upd = secom.copy()

# 1. Separate features (X) from the rest
# Drop the target and time column, and all other non-feature columns
feature_cols = [col for col in secom_upd.columns if col not in ["Time", "Pass/Fail"]]
X = secom_upd[feature_cols]

# Ensure all selected features are numeric (they should be after your cleaning steps)
X = X.select_dtypes(include=np.number)

# 2. Scale the data
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Convert back to DataFrame for easier inspection
X_scaled_df = pd.DataFrame(X_scaled, columns=X.columns)

# Assign a variable for the number of rows/samples
n_samples = X_scaled_df.shape[0]

# Print shape to confirm
print(f"Original feature shape: {X.shape}")
print(f"Scaled feature shape: {X_scaled_df.shape}")

# 3. Count outliers per column: |z| > 3
z_thresh = 3

# Use axis=0 to get counts per column, then wrap in a Series with column names
outlier_counts = pd.Series(
    (np.abs(X_scaled) > z_thresh).sum(axis=0),
    index=X.columns
)

print("\nOutliers per variable (|z| > 3):")
print(outlier_counts.sort_values(ascending=False).head(10))
Original feature shape: (1393, 558)
Scaled feature shape: (1393, 558)

Outliers per variable (|z| > 3):
38     68
576    62
574    60
572    54
577    53
573    52
558    49
575    48
160    48
295    48
dtype: int64
In [ ]:
cols_to_plot = ["38", "576", "574", "572", "577", "573"]

# Loop to generate plots for each specified column
for col in cols_to_plot:
    # Drop NaN values, as they cannot be plotted in a histogram or boxplot
    clean_data = secom[col].dropna()

    # Create a figure with two subplots side-by-side (1 row, 2 columns)
    fig, axes = plt.subplots(1, 2, figsize=(12, 5))

    # 1. Histogram (Distribution Plot)
    axes[0].hist(clean_data, bins=30, edgecolor='black', color='teal', alpha=0.7)
    axes[0].set_title(f'Histogram of Feature {col} Values')
    axes[0].set_xlabel(f'Feature {col} Value')
    axes[0].set_ylabel('Frequency (Instance Count)')
    axes[0].grid(axis='y', linestyle='--', alpha=0.7)

    # 2. Boxplot (Distribution Summary)
    axes[1].boxplot(clean_data, patch_artist=True,
                    boxprops=dict(facecolor='lightcoral', color='black'),
                    medianprops=dict(color='black'))
    axes[1].set_title(f'Boxplot for Feature {col}')
    axes[1].set_ylabel(f'Feature {col} Value')
    axes[1].set_xticks([1], [f'Feature {col}']) # Label the single box
    axes[1].grid(axis='y', linestyle='--', alpha=0.7)

    # Adjust layout and save the figure
    plt.tight_layout()
    plt.savefig(f'feature_{col}_hist_boxplot.png')
    plt.show()
    plt.close(fig)
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
In [ ]:
X_reset = X.reset_index(drop=True)
X_reset = np.array(X_reset)
In [ ]:
X_reset
Out[ ]:
array([[3.0957800e+03, 2.4651400e+03, 2.2304222e+03, ..., 2.0100000e-02,
        6.0000000e-03, 2.0820450e+02],
       [2.9326100e+03, 2.5599400e+03, 2.1864111e+03, ..., 4.8400000e-02,
        1.4800000e-02, 8.2860200e+01],
       [2.9887200e+03, 2.4799000e+03, 2.1990333e+03, ..., 1.4900000e-02,
        4.4000000e-03, 7.3843200e+01],
       ...,
       [2.9960400e+03, 2.5559200e+03, 2.1907666e+03, ..., 1.2300000e-02,
        4.6000000e-03, 6.7667600e+01],
       [3.2463100e+03, 2.4997900e+03, 2.2168111e+03, ..., 6.3000000e-03,
        1.9000000e-03, 2.3597900e+01],
       [3.0722000e+03, 2.4064700e+03, 2.1954444e+03, ..., 6.5000000e-03,
        2.2000000e-03, 2.7551400e+01]])

SMOTE Logistic Regression¶

In [ ]:
# --- ASSUMED DATA PREPARATION ---
# X_full is the full cleaned feature set
X_full = X_reset.copy()
# y_arr is the binary target array (Pass=-1 -> 0, Fail=1 -> 1)
y_arr = (secom_upd['Pass/Fail'] == 1).astype(int).values

RANDOM_STATE = 42
THRESHOLD = 0.5
STAGE_NAME = "1st Stage: Full Cleaned Features"

print(f"======== {STAGE_NAME} ({X_full.shape[1]} features) ========")

# 1. Split Data (Stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X_full, y_arr, test_size=0.3, random_state=RANDOM_STATE, stratify=y_arr
)

# 2. Preprocessing Pipeline: Impute (Median) -> Scale
preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

X_train_scaled = preprocessor.fit_transform(X_train)
X_test_scaled = preprocessor.transform(X_test)

# 3. SMOTE on Training Data
smote = SMOTE(sampling_strategy='minority', random_state=RANDOM_STATE)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

print(f"Train size: {len(y_train)} -> Resampled size: {len(y_train_resampled)}")

# 4. Logistic Regression Model
model = LogisticRegression(solver='liblinear', penalty='l2', random_state=RANDOM_STATE)
model.fit(X_train_resampled, y_train_resampled)

# 5. Prediction and Evaluation (on Test Set)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
y_pred = (y_pred_proba >= THRESHOLD).astype(int)

# Metrics
f1 = f1_score(y_test, y_pred)
auprc = average_precision_score(y_test, y_pred_proba)
cm = confusion_matrix(y_test, y_pred)

print("\n--- Performance Metrics (Test Set) ---")
print(f"F1-score: {f1:.4f}")
print(f"AUPRC:    {auprc:.4f}")

# 6. Plotting: Confusion Matrix & PR Curve
plt.figure(figsize=(12, 5))

# Confusion Matrix
plt.subplot(1, 2, 1)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Pass (0)', 'Fail (1)'])
disp.plot(ax=plt.gca(), cmap='Blues', values_format='d')
plt.title(f"Confusion Matrix\n({STAGE_NAME})")

plt.grid(False)

# Precision-Recall Curve
p, r, _ = precision_recall_curve(y_test, y_pred_proba)
baseline = np.sum(y_test) / len(y_test)

plt.subplot(1, 2, 2)
plt.plot(r, p, label=f'Test Set (AUPRC={auprc:.4f})')
plt.plot([0, 1], [baseline, baseline], linestyle='--', color='red', label=f'Baseline (AP={baseline:.4f})')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve (Test Set)")

plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
======== 1st Stage: Full Cleaned Features (558 features) ========
Train size: 975 -> Resampled size: 1812

--- Performance Metrics (Test Set) ---
F1-score: 0.2222
AUPRC:    0.1491
No description has been provided for this image

KMEANS METHOD¶

Using the elbow method to find the optimal number of clusters¶

In [ ]:
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(X_reset)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
No description has been provided for this image

The plot show that there is an elbow or "bend" at k=3 or 4 clusters

Training the K-means model on the dataset¶

In [ ]:
kmeans = KMeans(n_clusters = 2, init = 'k-means++', random_state = 42)
y_kmeans = kmeans.fit_predict(X_reset)
In [ ]:
pd.Series(y_kmeans).value_counts()
Out[ ]:
count
1 1101
0 292

Visualising the clusters¶

In [ ]:
plt.scatter(X_reset[y_kmeans == 0, 0], X_reset[y_kmeans == 0, 1], s = 15, c = 'lightskyblue', label = 'Cluster 1')
plt.scatter(X_reset[y_kmeans == 1, 0], X_reset[y_kmeans == 1, 1], s = 15, c = 'orange', label = 'Cluster 2')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 20, c = 'red', label = 'Centroids')
plt.title('Clusters of the test')
plt.xlabel('Component 1')
plt.ylabel('Component 1')
plt.legend()
plt.show()
No description has been provided for this image

Combine t-SNE and Kmeans¶

Apply a dimension reduction technique like t-SNE to get the data into that lower dimension before plotting the Kmeans cluster

In [ ]:
# Assuming X_scaled_df is your high-dimensional, scaled feature data
# Assuming y_kmeans contains the cluster labels from your K-Means run

# --- Step 1: Apply 2D t-SNE ---
# n_components=2 for 2D visualization
tsne_2d = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
X_tsne_2d = tsne_2d.fit_transform(X_scaled_df)

# Convert to DataFrame for easier plotting
X_tsne_df = pd.DataFrame(data = X_tsne_2d, columns = ['tSNE_1', 'tSNE_2'])
X_tsne_df['KMeans_Cluster'] = y_kmeans # Add cluster labels

# --- Step 2: Plot the 2D t-SNE results ---
plt.figure(figsize=(8, 6))

for cluster in sorted(X_tsne_df['KMeans_Cluster'].unique()):
    subset = X_tsne_df[X_tsne_df['KMeans_Cluster'] == cluster]
    plt.scatter(
        subset['tSNE_1'],
        subset['tSNE_2'],
        s = 10,
        label = f'Cluster {cluster}'
    )

plt.title('K-Means Clusters Visualized by 2D t-SNE')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.legend()
plt.show()
No description has been provided for this image
In [ ]:
# Assuming X_scaled_df and y_kmeans are ready

# --- Step 1: Apply 3D t-SNE ---
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
X_tsne_3d = tsne_3d.fit_transform(X_scaled_df)

# Convert to DataFrame
X_tsne_df_3d = pd.DataFrame(data = X_tsne_3d, columns = ['tSNE_1', 'tSNE_2', 'tSNE_3'])
X_tsne_df_3d['KMeans_Cluster'] = y_kmeans

# Define the color for the clusters
cluster_colors = {
    0: 'lightskyblue',
    1: 'orange',
}

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

for cluster_label, color_name in cluster_colors.items():
    subset = X_tsne_df_3d[X_tsne_df_3d['KMeans_Cluster'] == cluster_label]

    ax.scatter(
        subset['tSNE_1'],
        subset['tSNE_2'],
        subset['tSNE_3'],
        s=10,
        c=color_name,  # <<< Use the specific color here
        label=f'Cluster {cluster_label}',
        alpha=0.7
    )

ax.set_title('K-Means Clusters Visualized by 3D t-SNE')
ax.set_xlabel('t-SNE Component 1')
ax.set_ylabel('t-SNE Component 2')
ax.set_zlabel('t-SNE Component 3')
ax.legend()
plt.show()
No description has been provided for this image
In [ ]:
# Assuming X_tsne_df_3d is available and contains the data.
# Note: You'll need to run your data preparation steps first
# (TSNE n_components=3 and K-Means) to populate this DataFrame.

# Ensure the cluster column is treated as a string/categorical for distinct coloring
X_tsne_df_3d['KMeans_Cluster_Str'] = X_tsne_df_3d['KMeans_Cluster'].astype(str)

# Define the color of the Kmeans cluster
specific_plotly_colors = {
    '0': 'lightskyblue',
    '1': 'orange',
}

# Create the interactive 3D scatter plot using Plotly Express
fig = px.scatter_3d(
    X_tsne_df_3d,
    x='tSNE_1',
    y='tSNE_2',
    z='tSNE_3',
    color='KMeans_Cluster_Str',         # Color points by cluster
    color_discrete_map=specific_plotly_colors, # <<< Assign specific colors
    title='Interactive 3D K-Means Clusters (Plotly)',
    opacity=0.7,                        # Transparency level
    size_max=5,                         # <<< This controls the dot size (max diameter in pixels)
    labels={'tSNE_1': 't-SNE Component 1',
            'tSNE_2': 't-SNE Component 2',
            'tSNE_3': 't-SNE Component 3'}
)

# You can further customize the marker size and styling if needed:
fig.update_traces(marker=dict(size=3)) # <<< Sets a fixed size for all markers

fig.show() # This displays the rotatable plot in Google Colab

t-SNE¶

In [ ]:
# High number of features (590+) requires dimensionality reduction for visualization.
# Note: t-SNE is computationally expensive; reduce n_components to 2 for visualization.

# Run t-SNE on the scaled data
# n_components=2 for 2D visualization
tsne = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
X_tsne = tsne.fit_transform(X_scaled_df)

# Store the t-SNE results in the DataFrame
secom_upd['tSNE_1'] = X_tsne[:, 0]
secom_upd['tSNE_2'] = X_tsne[:, 1]

print("t-SNE reduction complete. Results stored in 'tSNE_1' and 'tSNE_2'.")
t-SNE reduction complete. Results stored in 'tSNE_1' and 'tSNE_2'.

DBSCAN and t-SNE¶

In [ ]:
# 1. --- FIT DBSCAN MODEL ---
# NOTE: The parameters (eps, min_samples) must be tuned for your SECOM data.
# A common starting point for min_samples is 2 * number of dimensions or ln(n_samples).
min_samples_start = 200      #int(np.log(X_scaled_df.shape[0]))
eps_start = 2 # Start with a large value and tune down, or use a k-distance plot.

dbscan = DBSCAN(eps=eps_start, min_samples=min_samples_start)
secom_clusters = dbscan.fit_predict(X_scaled_df)

print(f"DBSCAN Cluster Labels: {np.unique(secom_clusters)}")
print(f"Cluster Distribution (Tuning Required):\n{pd.Series(secom_clusters).value_counts()}")

# 2. --- PREPARE FOR VISUALIZATION (using t-SNE) ---
# DBSCAN was fit on the high-dimensional data, but we need 2D data for a plot.
# We will use t-SNE for dimensionality reduction for visualization.
tsne_2d = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
X_tsne_2d = tsne_2d.fit_transform(X_scaled_df)

# Create a DataFrame for plotting
df_secom = pd.DataFrame(X_tsne_2d, columns=['tSNE_1', 'tSNE_2'])
df_secom['Cluster'] = secom_clusters

# 3. --- PLOT THE DBSCAN CLUSTERS (using t-SNE components) ---
plt.figure(figsize=(10, 8))
plt.scatter(
    df_secom['tSNE_1'],
    df_secom['tSNE_2'],
    c=df_secom['Cluster'],
    cmap='viridis', # 'viridis' often works well for many colors
    s=10,          # Minimized dot size
    alpha=0.7
)
plt.title(f'DBSCAN Clusters on SECOM Data (Visualized with 2D t-SNE)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.colorbar(label='Cluster Label (-1 is Noise)')
plt.show()
DBSCAN Cluster Labels: [-1]
Cluster Distribution (Tuning Required):
-1    1393
Name: count, dtype: int64
No description has been provided for this image

Autoencoder for the original dataset¶

In [ ]:
# =========================================================================
# 1. DATA PREPARATION (SECOM specific changes)
# =========================================================================


X_secom_scaled_array = scaler.fit_transform(X)
y_secom_target = secom_upd['Pass/Fail'].values

# NOTE: Load your cleaned, scaled SECOM feature array (X_secom_scaled_array)
# and the target array (y_secom_target) here.
# We estimate the input dimension based on your previous steps (~590 features).
INPUT_DIM = X_secom_scaled_array.shape[1]
LATENT_DIM = 10
print(f"Detected Input Dimension from data: {INPUT_DIM}") # Check the output!

# Convert to tensor and create DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
X_tensor = torch.tensor(X_secom_scaled_array, dtype=torch.float32).to(device)
dataset = torch.utils.data.TensorDataset(X_tensor)
loader = torch.utils.data.DataLoader(dataset, batch_size=256, shuffle=True)

# =========================================================================
# 2. MODEL DEFINITION (Scaled for High Dimensionality)
# =========================================================================

class DeepAutoencoderSECOM(nn.Module):
    def __init__(self, input_dim=INPUT_DIM, latent_dim=LATENT_DIM):
        super(DeepAutoencoderSECOM, self).__init__()

        # Scaling layers for 590+ features, creating a wider funnel
        hidden_1 = 512
        hidden_2 = 128
        hidden_3 = 64

        # Encoder: 4 layers reducing down to the 2D bottleneck
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_1), # 589 -> 512
            nn.ReLU(),
            nn.Linear(hidden_1, hidden_2),  # 512 -> 128
            nn.ReLU(),
            nn.Linear(hidden_2, hidden_3),  # 128 -> 64
            nn.ReLU(),
            nn.Linear(hidden_3, latent_dim) # 64 -> 2D bottleneck
        )

        # Decoder: 4 layers reconstructing the input
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_3),  # 8 -> 64
            nn.ReLU(),
            nn.Linear(hidden_3, hidden_2),   # 64 -> 128
            nn.ReLU(),
            nn.Linear(hidden_2, hidden_1),   # 128 -> 512
            nn.ReLU(),
            nn.Linear(hidden_1, input_dim)   # 512 -> 589
        )

    def forward(self, x):
        z = self.encoder(x)
        x_hat = self.decoder(z)
        return z, x_hat

    def encode(self, x):
        """Method to explicitly get the latent representation Z."""
        return self.encoder(x)


# =========================================================================
# 3. TRAINING & PLOTTING
# =========================================================================

# Initialize model, optimizer, and loss function
model = DeepAutoencoderSECOM(input_dim=INPUT_DIM, latent_dim=LATENT_DIM).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

# Training loop
epochs = 50 # Reduced epochs for faster execution/testing. Increase this for final training.
model.train()
print(f"Starting training on {INPUT_DIM} features with {epochs} epochs...")
for epoch in range(epochs):
    total_loss = 0
    for batch in loader:
        batch_x = batch[0]
        optimizer.zero_grad()
        z, x_hat = model(batch_x)
        loss = criterion(x_hat, batch_x)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(loader)
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss={avg_loss:.6f}")

# Extract latent representation from encoder
model.eval()
with torch.no_grad():
    Z, _ = model(X_tensor)
    Z_np = Z.cpu().numpy()

# Plot latent 2D bottleneck space, colored by Pass/Fail target
plt.figure(figsize=(8, 7))

# Use the original SECOM labels (y_secom_target) for coloring
# Assuming 1 is 'Fail' and -1 is 'Pass' (based on your initial image)
# We use a custom cmap for clarity
plt.scatter(Z_np[:, 0], Z_np[:, 1], c=y_secom_target,
            cmap='coolwarm', s=10, alpha=0.7)

plt.title(f"SECOM Data - Deep Autoencoder Latent Space ({LATENT_DIM}D)")
plt.xlabel("Latent Component 1")
plt.ylabel("Latent Component 2")
plt.colorbar(label="Pass/Fail Label (-1 or 1)")
plt.grid(True)
plt.show()
Detected Input Dimension from data: 558
Starting training on 558 features with 50 epochs...
Epoch 5/50, Loss=0.703927
Epoch 10/50, Loss=0.647051
Epoch 15/50, Loss=0.613516
Epoch 20/50, Loss=0.564242
Epoch 25/50, Loss=0.527392
Epoch 30/50, Loss=0.496396
Epoch 35/50, Loss=0.463096
Epoch 40/50, Loss=0.444802
Epoch 45/50, Loss=0.423758
Epoch 50/50, Loss=0.404138
No description has been provided for this image
In [ ]:
# =========================================================================
# 4. Calculate Final Reconstruction Error on the entire dataset
# =========================================================================

model.eval()
with torch.no_grad():
    # Pass the entire dataset (X_tensor) through the model
    _, X_hat_final = model(X_tensor)

    # Calculate the MSE loss between the final reconstruction and the original data
    final_reconstruction_error = criterion(X_hat_final, X_tensor).item()

print(f"\nFinal Reconstruction MSE Error on Full Dataset: {final_reconstruction_error:.6f}")
Final Reconstruction MSE Error on Full Dataset: 0.398797

2. Feature Selection¶

Low-variance feature¶

The variables have low variance are quite useless for predicting. So I dicide to remove all of them.

In [ ]:
secom_upd = secom.copy()

# Find near-constant variables, excluding the target column "Pass/Fail"
near_const = []
for c in secom_upd.columns:
    if c == "Pass/Fail":
        continue  # don't consider the label column

    vc = secom_upd[c].value_counts(dropna=False, normalize=True)
    if not vc.empty and vc.iloc[0] >= 0.90:
        near_const.append(c)

print("Number of variables where ≥90% of values are identical (excluding 'Pass/Fail'):", len(near_const))
print("Example of near-constant variables:", near_const[:10])

print(secom_upd[near_const].value_counts())

# Remove low-variance variables, but keep "Pass/Fail"
secom_reduced = secom_upd.drop(columns=near_const)

print("Original shape:", secom_upd.shape)
print("Reduced shape:", secom_reduced.shape)
secom_reduced.describe()
Number of variables where ≥90% of values are identical (excluding 'Pass/Fail'): 126
Example of near-constant variables: ['5', '13', '42', '49', '52', '69', '74', '97', '114', '141']
5      13   42    49   52   69   74      97   114     141  149  178  179  186  189  190  191  192  193  194  206  209    226  229  230  231  232  233  234  235  236  237  240  241  242  243  249     256  257  258  259  260  261  262  263  264  265  266  276  284  313  314  315  322  325  326  327  328  329  330  342     347      364  369  370  371  372  373  374  375  378  379  380  381  387     394  395  396  397  398  399  400  401  402  403  404  414  422  449  450  451  458  461  462  463  464  465  466  478    481  498  501  502  503  504  505  506  507  508  509  512  513  514  515  521        528  529  530  531  532  533  534  535  536  537  538
100.0  0.0  70.0  1.0  0.0  1.0  0.0000  0.0  0.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    1371
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    1000.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0001  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0006  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0002  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1000.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                                                                                                                                                                               0.0008  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0002  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1000.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0003  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0034  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0011  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1000.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0007  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0068  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0021  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0013  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0127  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0040  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1000.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0016  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0158  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0050  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1000.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0018  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0178  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0056  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1000.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0019  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0195  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0062  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1000.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0020  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0141  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0046  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  718.6040   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0021  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0101  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0042  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  474.6376   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0044  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0442  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0140  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1000.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0051  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0306  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0110  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  604.2009   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0058  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0320  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0102  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  553.2097   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0129  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.1285  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0406  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1000.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0149  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.1159  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0370  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  776.2169   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0157  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0249  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0092  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  158.2158   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0197  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.1970  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0623  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1000.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0228  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.2067  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0650  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  907.9100   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0316  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.3161  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0999  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1000.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                              0.0414  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.00   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.4138  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0000   0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.1309  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0    0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  1000.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
                                 4.1955  0.0  0.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  2.0  46.15  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.4472  13.9147  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  200.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0000     0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0       1
Name: count, dtype: int64
Original shape: (1393, 560)
Reduced shape: (1393, 434)
Out[ ]:
0 1 2 3 4 6 7 8 9 10 ... 577 582 583 584 585 586 587 588 589 Pass/Fail
count 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 ... 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000 1393.000000
mean 3014.966274 2494.861981 2200.348900 1384.417093 2.928322 101.231608 0.122246 1.465172 -0.000861 0.000021 ... 16.434664 0.500049 0.015366 0.003858 3.078235 0.021743 0.016454 0.005279 98.088611 -0.857861
std 74.195555 81.555556 29.825001 415.755060 42.128656 5.900409 0.004998 0.072696 0.015069 0.009343 ... 12.336507 0.003426 0.018124 0.003930 3.776211 0.012602 0.008801 0.002865 92.838412 0.514067
min 2743.240000 2158.750000 2060.660000 0.000000 0.681500 82.131100 0.000000 1.191000 -0.053400 -0.034900 ... 4.582000 0.477800 0.006000 0.001700 1.197500 -0.006000 0.003200 0.001000 0.000000 -1.000000
25% 2966.240000 2450.890000 2180.966600 1084.377900 1.016000 98.131100 0.121100 1.412500 -0.010800 -0.005700 ... 11.280000 0.497900 0.011600 0.003100 2.305500 0.013600 0.010600 0.003300 44.175400 -1.000000
50% 3011.840000 2498.770000 2201.066700 1285.214400 1.307600 101.666700 0.122400 1.462900 -0.001300 0.000400 ... 13.733200 0.500100 0.013800 0.003600 2.760200 0.020800 0.014800 0.004600 71.084200 -1.000000
75% 3057.450000 2538.740000 2218.577800 1588.509000 1.518800 104.586700 0.123800 1.518600 0.008500 0.005800 ... 17.002800 0.502300 0.016400 0.004100 3.284400 0.027700 0.020300 0.006400 113.550600 -1.000000
max 3356.350000 2846.440000 2315.266700 3715.041700 1114.536600 129.252200 0.128600 1.656400 0.074900 0.053000 ... 96.960100 0.509800 0.476600 0.104500 99.303200 0.102800 0.079900 0.028600 737.304800 1.000000

8 rows × 433 columns

Multicollinearity - Filter by correlation test¶

_ Pearson correlation coefficient test is used for linear associations.

_ Spearman's rank coefficient test is used for non-linear associations.

_ Is there any monotonic relationships? (In a monotonic relationship, the variables tend to move in the same relative direction, but not necessarily at a constant rate.)

In [ ]:
# Subset a dataframe only contains x variables
df = secom_reduced.drop(columns=["Time", "Pass/Fail"], errors="ignore")
df = df.select_dtypes(include=[np.number]).copy()
df = df.fillna(df.median(numeric_only=True))

r_thresh   = 0.95   # linear redundancy cutoff
rho_thresh = 0.95   # monotonic redundancy cutoff
linear_guard = 0.85 # only treat as "nonlinear" if |Pearson| < 0.85

# ---- Stage 1: Pearson (linear) ----
R = df.corr(method="pearson").abs()
upper = R.where(np.triu(np.ones(R.shape), k=1).astype(bool))
drop1 = [c for c in upper.columns if (upper[c] >= r_thresh).any()]
keep1 = [c for c in df.columns if c not in drop1]
df1 = df[keep1]

print(f"[Stage 1] {df.shape[1]} → {df1.shape[1]} after linear prune")

# ---- Stage 2: Spearman (monotonic but not strongly linear) ----
if df1.shape[1] > 1:
    R1   = df1.corr(method="pearson").abs()
    S1   = df1.corr(method="spearman").abs()

    # We only consider pairs that are high-Spearman and low-Pearson
    mask = (S1 >= rho_thresh) & (R1 < linear_guard)
    mask_upper = mask.where(np.triu(np.ones(mask.shape), k=1).astype(bool))

    drop2 = [c for c in mask_upper.columns if mask_upper[c].any()]
else:
    drop2 = []

keep_final = [c for c in keep1 if c not in drop2]
df_pruned = df[keep_final]

print(f"[Stage 2] {df1.shape[1]} → {df_pruned.shape[1]} after monotonic prune")
selected_cols = keep_final
[Stage 1] 432 → 268 after linear prune
[Stage 2] 268 → 263 after monotonic prune
In [ ]:
# --- Easy report: just the names ---

print("\n[Stage 1] Linear prune (|Pearson r| ≥", r_thresh, ")")
print("Dropped:", len(drop1))
print(drop1)  # list of column names dropped in Stage 1

print("\n[Stage 2] Monotonic prune (|Spearman ρ| ≥", rho_thresh, "and |Pearson r| <", linear_guard, ")")
print("Dropped:", len(drop2))
print(drop2)  # list of column names dropped in Stage 2
[Stage 1] Linear prune (|Pearson r| ≥ 0.95 )
Dropped: 164
['27', '36', '96', '104', '105', '106', '127', '140', '148', '152', '165', '174', '252', '271', '272', '274', '275', '277', '279', '280', '281', '282', '283', '285', '286', '287', '288', '289', '291', '294', '295', '296', '297', '298', '299', '300', '301', '302', '303', '304', '305', '306', '307', '308', '309', '310', '311', '312', '317', '318', '319', '320', '321', '323', '324', '332', '333', '334', '335', '336', '338', '339', '340', '341', '343', '344', '349', '350', '351', '352', '353', '354', '355', '357', '359', '360', '361', '362', '363', '365', '366', '376', '386', '388', '389', '390', '391', '392', '393', '405', '406', '407', '408', '409', '410', '411', '415', '416', '417', '420', '421', '424', '425', '426', '427', '428', '429', '435', '436', '437', '440', '441', '442', '443', '444', '445', '446', '447', '448', '452', '453', '454', '455', '457', '459', '467', '469', '470', '475', '477', '479', '490', '491', '493', '494', '495', '497', '520', '522', '523', '524', '525', '526', '527', '539', '540', '541', '545', '552', '553', '554', '556', '557', '561', '566', '567', '568', '574', '575', '576', '577', '584', '585', '588']

[Stage 2] Monotonic prune (|Spearman ρ| ≥ 0.95 and |Pearson r| < 0.85 )
Dropped: 5
['18', '201', '337', '431', '496']

Heatmap for correlation matrix¶

In [ ]:
# ---- 1. Select the specific variables by position ----
var_names = ["36", "34", "174", "172", "275", "4", "140", "585", "448"]

df_sub = df[var_names].copy()

df_sub_scaled = df_sub

# ---- 3. Pearson correlation matrix ----
corr_matrix = df_sub_scaled.corr(method="pearson")

# ---- 4. Heatmap ----
plt.figure(figsize=(8, 6))
sns.heatmap(
    corr_matrix,
    annot=True,       # show numbers; set to False if too busy
    fmt=".2f",
    cmap="coolwarm",
    vmin=-1, vmax=1,
    square=True,
    linewidths=0.5
)
plt.title("Pearson Correlation Heatmap – Selected Variables")
plt.tight_layout()
plt.show()
No description has been provided for this image
In [ ]:
cols = ["431", "160", "201", "196", "496", "224", "18", "12", "337"]

# Keep only those that actually exist in df (in case any are missing)
cols_in_df = [c for c in cols if c in df.columns]
print("Columns used for Spearman heatmap:", cols_in_df)

df_sub = df[cols_in_df].copy()

# ---- Spearman correlation matrix ----
spearman_corr = df_sub.corr(method="spearman")

# ---- Heatmap using matplotlib ----
fig, ax = plt.subplots(figsize=(8, 6))

im = ax.imshow(spearman_corr, vmin=-1, vmax=1)

# Add colorbar
cbar = fig.colorbar(im, ax=ax)
cbar.set_label("Spearman correlation", rotation=270, labelpad=15)

# Set tick labels
ax.set_xticks(np.arange(len(cols_in_df)))
ax.set_yticks(np.arange(len(cols_in_df)))
ax.set_xticklabels(cols_in_df, rotation=45, ha="right")
ax.set_yticklabels(cols_in_df)

ax.set_title("Spearman Correlation Heatmap – Selected Variables")

plt.tight_layout()
plt.show()
Columns used for Spearman heatmap: ['431', '160', '201', '196', '496', '224', '18', '12', '337']
No description has been provided for this image

Scatterplot of pair of varaibles that has top 5 highest correlation in Stage 1 - Linear prune

In [ ]:
import os

def plot_stage1_linear_pairs(df, R, drop1, keep1, r_thresh=0.95, top_k=12, save_dir=None):
    """
    For each column in drop1, find the kept column in keep1 with the strongest |Pearson r|
    (>= r_thresh) and plot scatter + fitted line. Plots up to top_k strongest pairs.
    """
    pairs = []
    for c in drop1:
        # Prefer a KEPT partner so you visualize the redundancy to what remains
        cand = [k for k in keep1 if k != c and abs(R.loc[c, k]) >= r_thresh]
        if not cand:
            # fallback: any column meeting threshold (rare)
            cand = [k for k in df.columns if k != c and abs(R.loc[c, k]) >= r_thresh]
        if cand:
            k = max(cand, key=lambda col: abs(R.loc[c, col]))
            pairs.append((c, k, float(R.loc[c, k])))

    if not pairs:
        print("No Stage-1 pairs found at this threshold.")
        return

    # strongest first
    pairs.sort(key=lambda t: -abs(t[2]))
    if save_dir:
        os.makedirs(save_dir, exist_ok=True)

    for x_col, y_col, r_val in pairs[:top_k]:
        x = df[x_col].to_numpy()
        y = df[y_col].to_numpy()
        m = np.isfinite(x) & np.isfinite(y)
        x, y = x[m], y[m]

        # simple linear fit for the smoothing line
        a, b = np.polyfit(x, y, 1)
        x_line = np.linspace(x.min(), x.max(), 200)
        y_line = a * x_line + b

        plt.figure()
        plt.scatter(x, y, s=8, alpha=0.6)
        plt.plot(x_line, y_line, linewidth=2)
        plt.xlabel(x_col)
        plt.ylabel(y_col)
        plt.title(f"{x_col} vs {y_col}  |  Pearson r = {r_val:.3f}")
        plt.tight_layout()
        if save_dir:
            fname = f"{x_col}__vs__{y_col}__r_{abs(r_val):.3f}.png"
            plt.savefig(os.path.join(save_dir, fname), dpi=150)
        plt.show()

# Use it:
plot_stage1_linear_pairs(df, R, drop1, keep1, r_thresh=r_thresh, top_k=12, save_dir="stage1_pairs")
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Scatterplot of variables in Stage 2 - Monotonic prune

In [ ]:
# === Plot strongest Stage-2 (monotonic but not strongly linear) pairs ===
from sklearn.isotonic import IsotonicRegression

pairs = []
for c in drop2:
    # partners that satisfy your Stage-2 rule with column c
    candidates = [k for k in df1.columns
                  if k != c and (S1.loc[c, k] >= rho_thresh) and (R1.loc[c, k] < linear_guard)]
    if not candidates:
        continue
    # keep the partner with the largest |Spearman|
    best = max(candidates, key=lambda k: S1.loc[c, k])
    # signed Spearman for direction; Pearson shown as absolute (matches your rule)
    s_signed = df1[c].corr(df1[best], method="spearman")
    r_abs    = R1.loc[c, best]
    pairs.append((c, best, float(s_signed), float(r_abs)))

# strongest first; plot up to N figures
pairs.sort(key=lambda t: -abs(t[2]))
N = min(12, len(pairs))   # change to see more/less plots

if N == 0:
    print("No Stage-2 monotonic pairs found at current thresholds.")
else:
    for x_col, y_col, s_val, r_abs in pairs[:N]:
        x = df1[x_col].to_numpy()
        y = df1[y_col].to_numpy()
        m = np.isfinite(x) & np.isfinite(y)
        x, y = x[m], y[m]

        # sort by x so the line draws cleanly
        order = np.argsort(x)
        xs, ys = x[order], y[order]

        # monotonic smoother (handles increasing or decreasing)
        ir = IsotonicRegression(increasing=True, out_of_bounds="clip")
        if s_val >= 0:
            y_fit = ir.fit_transform(xs, ys)
        else:
            y_fit = -ir.fit_transform(xs, -ys)

        plt.figure()
        plt.scatter(xs, ys, s=8, alpha=0.6)
        plt.plot(xs, y_fit, linewidth=2)
        plt.xlabel(x_col)
        plt.ylabel(y_col)
        plt.title(f"{x_col} vs {y_col}  |  Spearman={s_val:.2f}, |Pearson|={r_abs:.2f}")
        plt.tight_layout()
        plt.show()
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

Comment:

The off-diagonal panels show strong, monotonic but non-linear relations (mostly inverse “L”/hyperbolic shapes). That’s why these pairs passed your Stage-2 rule (high Spearman, low Pearson).

Distributions are highly skewed with floor/saturation near zero and heteroscedasticity (fan shapes). A few outliers exist but don’t drive the overall trend.

Substantively, these five sensors likely track the same latent process/step; they’re largely redundant for modeling.

SECOM labels from {-1, 1} → {0, 1} with 1 = Fail and 0 = Pass.¶

In [ ]:
# Target column
y_col = "Pass/Fail"

# 1) Pull y aligned to df_pruned's rows
y_raw = secom_reduced.loc[df_pruned.index, y_col]

# (optional) sanity check so you don't silently mislabel
vals = pd.unique(y_raw.dropna())
assert set(vals).issubset({-1, 1}), f"Unexpected labels in {y_col}: {vals}"

# 2) Remap {-1, 1} -> {0, 1}  (0=Pass, 1=Fail)
# You can use either replace(...) or a boolean expression; both shown:
y = y_raw.replace({-1: 0, 1: 1}).astype("int8")
# y = (y_raw == 1).astype("int8")   # equivalent
In [ ]:
print(y)
X = df_pruned
print(X)
1       0
2       1
3       0
4       0
5       0
       ..
1537    0
1539    0
1540    0
1541    0
1550    0
Name: Pass/Fail, Length: 1393, dtype: int8
            0        1          2          3       4         6       7  \
1     3095.78  2465.14  2230.4222  1463.6606  0.8294  102.3433  0.1247   
2     2932.61  2559.94  2186.4111  1698.0172  1.5102   95.4878  0.1241   
3     2988.72  2479.90  2199.0333   909.7926  1.3204  104.2367  0.1217   
4     3032.24  2502.87  2233.3667  1326.5200  1.5334  100.3967  0.1235   
5     2946.25  2432.84  2233.3667  1326.5200  1.5334  100.3967  0.1235   
...       ...      ...        ...        ...     ...       ...     ...   
1537  3006.22  2525.20  2192.7889  1268.5852  1.9935  104.5867  0.1268   
1539  2908.94  2560.99  2187.3444  2882.8558  1.5876   85.4189  0.1235   
1540  2996.04  2555.92  2190.7666  3530.2362  0.8017   83.8767  0.1249   
1541  3246.31  2499.79  2216.8111  1190.4067  2.5148  114.5533  0.1230   
1550  3072.20  2406.47  2195.4444  2914.1792  1.5978   85.1011  0.1235   

           8       9      10  ...      569       570     571     572     573  \
1     1.4966 -0.0005 -0.0148  ...  17.1574  535.0164  2.4335    5.92  0.2653   
2     1.4436  0.0041  0.0013  ...  68.8489  535.0245  2.0293   11.21  0.1882   
3     1.4882 -0.0124 -0.0033  ...  25.0363  530.5682  2.0253    9.33  0.1738   
4     1.5031 -0.0031 -0.0072  ...  17.1574  532.0155  2.0275    8.83  0.2224   
5     1.5287  0.0167  0.0055  ...  22.5598  534.2091  2.3236    8.91  0.3201   
...      ...     ...     ...  ...      ...       ...     ...     ...     ...   
1537  1.4522 -0.0039 -0.0075  ...  19.6342  533.7318  2.0706    8.13  0.5545   
1539  1.4167  0.0041 -0.0056  ...  24.7258  569.9964  1.7961  441.92  1.2113   
1540  1.4158 -0.0029  0.0000  ...  15.2498  525.6127  2.0059   10.25  0.5070   
1541  1.3966 -0.0057 -0.0169  ...  15.4662  526.2355  2.1114    8.15  0.2497   
1550  1.3148 -0.0024  0.0039  ...  15.4662  532.1618  2.1401    7.11  0.4659   

         582     583     586     587       589  
1     0.5019  0.0223  0.0096  0.0201  208.2045  
2     0.4958  0.0157  0.0584  0.0484   82.8602  
3     0.4990  0.0103  0.0202  0.0149   73.8432  
4     0.4800  0.4766  0.0202  0.0149   73.8432  
5     0.4949  0.0189  0.0342  0.0151   44.0077  
...      ...     ...     ...     ...       ...  
1537  0.4942  0.0175  0.0199  0.0097   48.7045  
1539  0.4987  0.0118  0.0237  0.0112   47.3376  
1540  0.5011  0.0163  0.0181  0.0123   67.6676  
1541  0.5021  0.0103  0.0266  0.0063   23.5979  
1550  0.5034  0.0178  0.0236  0.0065   27.5514  

[1393 rows x 263 columns]
In [ ]:
from sklearn.linear_model import LogisticRegression
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    confusion_matrix, f1_score, average_precision_score,
    precision_recall_curve, ConfusionMatrixDisplay
)
from imblearn.over_sampling import SMOTE

# --- ASSUMED DATA PREPARATION ---
# X_df is the cleaned feature set
X_df = df_pruned.copy()
# y_arr is the binary target array (Pass=-1 -> 0, Fail=1 -> 1)
y_arr = (secom_upd['Pass/Fail'] == 1).astype(int).values

RANDOM_STATE = 42
THRESHOLD = 0.5
STAGE_NAME = "2nd Stage: 263 Features"

print(f"======== {STAGE_NAME} ({X_df.shape[1]} features) ========")

# 1. Split Data (Stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X_df, y_arr, test_size=0.3, random_state=RANDOM_STATE, stratify=y_arr
)

# 2. Preprocessing Pipeline: Impute (Median) -> Scale
preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

X_train_scaled = preprocessor.fit_transform(X_train)
X_test_scaled = preprocessor.transform(X_test)

# 3. SMOTE on Training Data
smote = SMOTE(sampling_strategy='minority', random_state=RANDOM_STATE)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

print(f"Train size: {len(y_train)} -> Resampled size: {len(y_train_resampled)}")

# 4. Logistic Regression Model
model = LogisticRegression(solver='liblinear', penalty='l2', random_state=RANDOM_STATE)
model.fit(X_train_resampled, y_train_resampled)

# 5. Prediction and Evaluation (on Test Set)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
y_pred = (y_pred_proba >= THRESHOLD).astype(int)

# Metrics
f1 = f1_score(y_test, y_pred)
auprc = average_precision_score(y_test, y_pred_proba)
cm = confusion_matrix(y_test, y_pred)

print("\n--- Performance Metrics (Test Set) ---")
print(f"F1-score: {f1:.4f}")
print(f"AUPRC:    {auprc:.4f}")

# 6. Plotting: Confusion Matrix & PR Curve
plt.figure(figsize=(12, 5))

# Confusion Matrix
plt.subplot(1, 2, 1)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Pass (0)', 'Fail (1)'])
disp.plot(ax=plt.gca(), cmap='Blues', values_format='d')
plt.title(f"Confusion Matrix\n({STAGE_NAME})")

plt.grid(False)

# Precision-Recall Curve
p, r, _ = precision_recall_curve(y_test, y_pred_proba)
baseline = np.sum(y_test) / len(y_test)

plt.subplot(1, 2, 2)
plt.plot(r, p, label=f'Test Set (AUPRC={auprc:.4f})')
plt.plot([0, 1], [baseline, baseline], linestyle='--', color='red', label=f'Baseline (AP={baseline:.4f})')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve (Test Set)")

plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
======== 2nd Stage: 263 Features (263 features) ========
Train size: 975 -> Resampled size: 1812

--- Performance Metrics (Test Set) ---
F1-score: 0.2222
AUPRC:    0.1537
No description has been provided for this image

Recursive Feature Elimination¶

I performed supervised feature selection using Recursive Feature Elimination with cross‑validation (RFECV) on a cleaned feature set (df_pruned) that already had:

  • missing values handled,

  • near‑constant (low‑variance) variables removed, and

  • strongly collinear variables pruned.

RFE then learned which of the remaining sensors are most predictive of failures, and selected a compact subset while nested cross‑validation gave unbiased performance estimates and a stability read‑out (how often each feature is chosen across folds).

After handling missing values, removing near‑constant variables, and pruning highly correlated sensors, we applied recursive feature elimination with cross‑validation (RFECV) using a class‑weighted logistic regression base estimator. All preprocessing (median imputation, z‑scaling) and feature selection were encapsulated in a Pipeline and evaluated under a nested 5‑fold stratified cross‑validation scheme. In the inner loop, RFECV iteratively removed the least important features (10% per step), and selected the number of features that maximized average precision; the outer loop provided unbiased estimates of generalization performance. We additionally computed the selection frequency of each feature across outer folds to quantify stability and defined a ‘stable’ subset as features selected in at least 60% of folds.

In [ ]:
# ============================
# Recursive Feature Elimination (RFECV) on df_pruned
# Place this AFTER you define `df_pruned` and `y`
# ============================

RANDOM_STATE = 42

# --- 0) Inputs from your pipeline ---
X = df_pruned.copy()         # features *after* your low-variance + correlation pruning
feature_names = X.columns
assert len(X) == len(y), "X and y must align"

# --- 1) Base estimator for RFE (fast, linear, class-weighted for imbalance) ---
base_est = LogisticRegression(
    penalty="l2",
    solver="liblinear",
    class_weight="balanced",
    max_iter=5000,
    random_state=RANDOM_STATE
)

# --- 2) Inner CV (chooses how many features RFE keeps) + outer CV (evaluates the entire pipeline on unseen data → unbiased metrics.) ---
inner_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE + 1)

# Reasonable lower bound: keep at least 5% of features or 20, whichever is larger
min_keep = max(20, int(0.05 * X.shape[1]))

rfecv = RFECV(
    estimator=base_est,     # Train the estimator, rank features by importance
    step=0.1,               # remove ~10% of remaining features per step; use an int (e.g., 50) to cap compute
    min_features_to_select=min_keep,
    cv=inner_cv,   # Re‑fit and re‑score via inner CV
    scoring="average_precision",     # AUPRC is robust for your class imbalance
    n_jobs=-1
)

pipe = Pipeline(steps=[
    # Everything happens inside the pipeline so the imputer, scaler, and feature selection are fit only on training folds and then applied to validation/test folds. This prevents leakage.
    ("impute", SimpleImputer(strategy="median")),
    ("scale", StandardScaler()),
    ("rfe", rfecv),
    # After RFE chooses a subset, a fresh logistic model is trained on that subset in each fold.
    ("clf", LogisticRegression(
        penalty="l2",
        solver="liblinear",
        class_weight="balanced",
        max_iter=5000,
        random_state=RANDOM_STATE
    ))
])

# --- 3) Evaluate with nested CV + return the fitted estimators to study selections ---
scoring = {
    "ap": "average_precision",
    "roc": "roc_auc",
    "f1": "f1",
    "mcc": make_scorer(matthews_corrcoef)
}

# Reports AUPRC, ROC‑AUC, F1, MCC across the 5 outer folds.
cv_res = cross_validate( pipe, X, y, cv=outer_cv,
    scoring=scoring,
    return_estimator=True, n_jobs=-1
)

print(
    f"AUPRC  (mean±sd): {cv_res['test_ap'].mean():.3f} ± {cv_res['test_ap'].std():.3f}\n"
    f"ROC-AUC (mean±sd): {cv_res['test_roc'].mean():.3f} ± {cv_res['test_roc'].std():.3f}\n"
    f"F1      (mean±sd): {cv_res['test_f1'].mean():.3f} ± {cv_res['test_f1'].std():.3f}\n"
    f"MCC     (mean±sd): {cv_res['test_mcc'].mean():.3f} ± {cv_res['test_mcc'].std():.3f}"
)

# --- 4) Inspect which features are repeatedly selected across outer folds ---
# For each feature, compute the fraction of outer folds in which it was selected → selection frequency in [0,1].
support_matrix = np.vstack([est.named_steps["rfe"].support_ for est in cv_res["estimator"]])
freq = support_matrix.mean(axis=0)  # selection frequency in [0,1]

selection_frequency = (
    pd.Series(freq, index=feature_names, name="selection_freq")
      .sort_values(ascending=False)
)

# Thresholding heuristic for a stable final set:
# keep features selected in at least 60% of outer folds (tune if needed)
stable_mask = selection_frequency >= 0.60
selected_cols_rfe = selection_frequency.index[stable_mask].tolist()

print(f"\nRFECV kept (median across folds): "
      f"{int(np.median([est.named_steps['rfe'].n_features_ for est in cv_res['estimator']]))} features")
print(f"Stable features (≥60% of folds): {len(selected_cols_rfe)}")
print(selection_frequency.head(20))           # top-20 by stability
# selection_frequency.to_csv("rfe_feature_stability.csv", index=True)

# --- 5) (Optional) Refit a final model on ALL data using the stable subset ---
# You can swap LogisticRegression with XGBClassifier/RandomForest if preferred.
final_model = LogisticRegression(
    penalty="l2", solver="liblinear", class_weight="balanced",
    max_iter=5000, random_state=RANDOM_STATE
)
final_model.fit(X[selected_cols_rfe], y)

# If you want to carry the selected set forward in your notebook:
# selected_cols = selected_cols_rfe
AUPRC  (mean±sd): 0.161 ± 0.042
ROC-AUC (mean±sd): 0.670 ± 0.088
F1      (mean±sd): 0.204 ± 0.070
MCC     (mean±sd): 0.137 ± 0.088

RFECV kept (median across folds): 55 features
Stable features (≥60% of folds): 64
59     1.0
56     1.0
64     1.0
151    1.0
155    1.0
204    1.0
65     0.8
25     0.8
121    0.8
117    0.8
130    0.8
124    0.8
45     0.8
67     0.8
129    0.8
61     0.8
471    0.8
197    0.8
172    0.8
205    0.8
Name: selection_freq, dtype: float64
Out[ ]:
LogisticRegression(class_weight='balanced', max_iter=5000, random_state=42,
                   solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(class_weight='balanced', max_iter=5000, random_state=42,
                   solver='liblinear')

Visualize Feature-selection stability¶

In [ ]:
# ---- 1A. "Scree"-like line: selection frequency sorted descending ----
sf = selection_frequency.sort_values(ascending=False)
k_stable = int((sf >= 0.60).sum())

plt.figure(figsize=(8, 4))
plt.plot(range(1, len(sf) + 1), sf.values, marker='o', linewidth=1)
plt.axhline(0.60, linestyle='--')
plt.axvline(k_stable, linestyle='--')
plt.title("RFE selection stability (outer CV)")
plt.xlabel("Feature rank by stability")
plt.ylabel("Selection frequency")
plt.tight_layout()
plt.show()

# ---- 1B. Top‑k bar chart ----
top_k = min(20, len(sf))
top = sf.head(top_k)[::-1]  # reverse for horizontal barh
plt.figure(figsize=(8, max(4, top_k * 0.28)))
plt.barh(top.index.astype(str), top.values)
plt.xlabel("Selection frequency")
plt.title(f"Top {top_k} features by RFE stability")
plt.tight_layout()
plt.show()
No description has been provided for this image
No description has been provided for this image

Comment:

In [ ]:
# cv_res contains test scores for each outer fold
metrics_df = pd.DataFrame({
    "AUPRC":  cv_res["test_ap"],
    "ROC-AUC": cv_res["test_roc"],
    "F1":     cv_res["test_f1"],
    "MCC":    cv_res["test_mcc"],
})

plt.figure(figsize=(7, 4))
plt.boxplot([metrics_df[c].values for c in metrics_df.columns], labels=list(metrics_df.columns))
plt.ylabel("Score")
plt.title("Outer-CV performance with RFE pipeline")
plt.tight_layout()
plt.show()
No description has been provided for this image

What is the next step?¶

  • RFECV favored compact subsets (median 20 features), and selection‑frequency analysis revealed a stable core (5 features at 100%, 7 at 80%). To prevent reinserting correlated surrogates and to reduce variance, we fixed the final subset at the top‑20 by stability. We also evaluated the 31‑feature union (≥60% frequency) as a robustness check; results were comparable, so we report 20 features for parsimony and interpretability.
  • I included PCA (retain 90–95% variance) as an unsupervised baseline; manifold methods (t‑SNE/MDS) were used solely for visualization due to their lack of stable, out‑of‑sample transforms.
In [ ]:
# ============================================================
# Select TOP-K features by outer-CV AUPRC, then finalize model
# ============================================================

# --- 0) Ensure CV / scoring objects ---
if "outer_cv" not in globals():
    outer_cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = {
    "ap":  "average_precision",
    "roc": "roc_auc",
    "f1":  "f1",
    "mcc": make_scorer(matthews_corrcoef),
}

# --- 1) Choose the candidate pool to rank (pick one) ---
assert "selection_frequency" in globals() and "selected_cols_rfe" in globals(), "Run the RFE block first."
CANDIDATE_POOL = "rfe_stable"   # options: "rfe_stable", "rfe20", "all"

if CANDIDATE_POOL == "rfe_stable":
    candidates = list(selected_cols_rfe)  # ≥60% stable set (e.g., ~31)
elif CANDIDATE_POOL == "rfe20":
    candidates = selection_frequency.sort_values(ascending=False).head(20).index.tolist()
elif CANDIDATE_POOL == "all":
    candidates = list(X.columns)
else:
    raise ValueError("CANDIDATE_POOL must be one of {'rfe_stable','rfe20','all'}")

# --- 2) Set k (how many features to keep) ---
k = 20  # <-- change this to the number of features you want
k = min(k, len(candidates))

# --- 3) Define the univariate evaluation pipeline (Impute->Scale->Logistic) ---
uni_pipe = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale",  StandardScaler()),
    ("clf",    LogisticRegression(
        penalty="l2", solver="liblinear", class_weight="balanced",
        max_iter=5000, random_state=RANDOM_STATE
    ))
])

# --- 4) Score each feature individually by AUPRC using the same outer CV ---
rows = []
for f in candidates:
    Xi = X[[f]].copy()
    ap_scores = cross_val_score(uni_pipe, Xi, y, cv=outer_cv, scoring="average_precision", n_jobs=-1)
    rows.append({"feature": f, "ap_mean": ap_scores.mean(), "ap_sd": ap_scores.std()})

rank_df = pd.DataFrame(rows).sort_values(["ap_mean", "ap_sd"], ascending=[False, True]).reset_index(drop=True)

topk_features = rank_df.head(k)["feature"].tolist()
print(f"\nTop-{k} features by outer-CV AUPRC from pool '{CANDIDATE_POOL}':")
print(rank_df.head(k))

# --- 5) Build the final k-feature model (no PCA), evaluate, and pick threshold by OOF-F1 ---
final_pipe_k = Pipeline([
    ("impute", SimpleImputer(strategy="median")),
    ("scale",  StandardScaler()),
    ("clf",    LogisticRegression(
        penalty="l2", solver="liblinear", class_weight="balanced",
        max_iter=5000, random_state=RANDOM_STATE
    ))
])

X_k = X[topk_features].copy()

# 5a) Outer-CV performance snapshot (optional, for reporting)
cv_perf = cross_validate(final_pipe_k, X_k, y, cv=outer_cv, scoring=scoring, n_jobs=-1)
print(
    f"\n[Top-{k} model | outer CV]\n"
    f"AUPRC  (mean±sd): {cv_perf['test_ap'].mean():.3f} ± {cv_perf['test_ap'].std():.3f}\n"
    f"ROC-AUC (mean±sd): {cv_perf['test_roc'].mean():.3f} ± {cv_perf['test_roc'].std():.3f}\n"
    f"F1      (mean±sd): {cv_perf['test_f1'].mean():.3f} ± {cv_perf['test_f1'].std():.3f}\n"
    f"MCC     (mean±sd): {cv_perf['test_mcc'].mean():.3f} ± {cv_perf['test_mcc'].std():.3f}"
)

# 5b) Pick a deployable operating threshold by maximizing F1 on OOF probabilities (no leakage)
oof_prob = cross_val_predict(final_pipe_k, X_k, y, cv=outer_cv, method="predict_proba", n_jobs=-1)[:, 1]
p, r, thr = precision_recall_curve(y, oof_prob)
f1 = 2 * p[1:] * r[1:] / (p[1:] + r[1:] + 1e-12)
thr_star = float(thr[np.argmax(f1)])
print(f"Chosen operating threshold (max F1 on OOF): {thr_star:.3f}")

# --- 6) Fit the final model on ALL data (ready for deployment) ---
final_pipe_k.fit(X_k, y)

# --- 7) Prediction helper for new data ---
def predict_with_threshold_topk(pipe, X_new_df, features=topk_features, threshold=thr_star):
    Xn = X_new_df[features].copy()
    prob = pipe.predict_proba(Xn)[:, 1]
    yhat = (prob >= threshold).astype("int8")
    return yhat, prob
Top-20 features by outer-CV AUPRC from pool 'rfe_stable':
   feature   ap_mean     ap_sd
0       59  0.180647  0.037314
1       64  0.171111  0.108788
2       21  0.156464  0.045563
3      348  0.155786  0.066097
4      205  0.149234  0.015680
5       65  0.147031  0.078618
6      124  0.138425  0.068338
7      121  0.138024  0.066722
8      163  0.136071  0.049017
9       25  0.130879  0.034320
10     197  0.129044  0.038366
11     123  0.128341  0.059613
12     100  0.125632  0.055898
13      99  0.123956  0.042941
14     573  0.121167  0.033835
15     164  0.116457  0.032798
16      26  0.115988  0.037675
17     129  0.113052  0.018461
18     138  0.111983  0.031427
19     565  0.106037  0.041527

[Top-20 model | outer CV]
AUPRC  (mean±sd): 0.246 ± 0.048
ROC-AUC (mean±sd): 0.778 ± 0.029
F1      (mean±sd): 0.261 ± 0.023
MCC     (mean±sd): 0.225 ± 0.033
Chosen operating threshold (max F1 on OOF): 0.762

3. Feature Extraction¶

SMOTE Logistic Regression¶

In [ ]:
# --- ASSUMED DATA PREPARATION ---
# X_df is the RFE-selected top 20 features
X_df = X_k.copy()
# y_arr is the binary target array (Pass=-1 -> 0, Fail=1 -> 1)
y_arr = (secom_upd['Pass/Fail'] == 1).astype(int).values

RANDOM_STATE = 42
THRESHOLD = 0.5
STAGE_NAME = "3rd Stage: Top 20 RFE Features"

print(f"======== {STAGE_NAME} ({X_df.shape[1]} features) ========")

# 1. Split Data (Stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X_df, y_arr, test_size=0.3, random_state=RANDOM_STATE, stratify=y_arr
)

# 2. Preprocessing Pipeline: Impute (Median) -> Scale
preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

X_train_scaled = preprocessor.fit_transform(X_train)
X_test_scaled = preprocessor.transform(X_test)

# 3. SMOTE on Training Data
smote = SMOTE(sampling_strategy='minority', random_state=RANDOM_STATE)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

print(f"Train size: {len(y_train)} -> Resampled size: {len(y_train_resampled)}")

# 4. Logistic Regression Model
model = LogisticRegression(solver='liblinear', penalty='l2', random_state=RANDOM_STATE)
model.fit(X_train_resampled, y_train_resampled)

# 5. Prediction and Evaluation (on Test Set)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
y_pred = (y_pred_proba >= THRESHOLD).astype(int)

# Metrics
f1 = f1_score(y_test, y_pred)
auprc = average_precision_score(y_test, y_pred_proba)
cm = confusion_matrix(y_test, y_pred)

print("\n--- Performance Metrics (Test Set) ---")
print(f"F1-score: {f1:.4f}")
print(f"AUPRC:    {auprc:.4f}")

# 6. Plotting: Confusion Matrix & PR Curve
plt.figure(figsize=(12, 5))

# Confusion Matrix
plt.subplot(1, 2, 1)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Pass (0)', 'Fail (1)'])
disp.plot(ax=plt.gca(), cmap='Blues', values_format='d')
plt.title(f"Confusion Matrix\n({STAGE_NAME})")

plt.grid(False)

# Precision-Recall Curve
p, r, _ = precision_recall_curve(y_test, y_pred_proba)
baseline = np.sum(y_test) / len(y_test)

plt.subplot(1, 2, 2)
plt.plot(r, p, label=f'Test Set (AUPRC={auprc:.4f})')
plt.plot([0, 1], [baseline, baseline], linestyle='--', color='red', label=f'Baseline (AP={baseline:.4f})')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve (Test Set)")

plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
======== 3rd Stage: Top 20 RFE Features (20 features) ========
Train size: 975 -> Resampled size: 1812

--- Performance Metrics (Test Set) ---
F1-score: 0.2326
AUPRC:    0.1628
No description has been provided for this image

Apply cluster method for top 20 features¶

In [ ]:
Xk_reset = X_k.reset_index(drop=True)
Xk_reset = np.array(Xk_reset)
In [ ]:
Xk_reset
Out[ ]:
array([[ 8.07300e-01,  1.91927e+01, -5.44150e+03, ..., -9.46000e-02,
         5.62000e+01,  1.19900e-01],
       [ 2.38245e+01,  1.61755e+01, -5.44775e+03, ..., -1.89200e-01,
         4.51001e+01,  6.21900e-01],
       [ 2.43791e+01,  1.56209e+01, -5.46825e+03, ...,  2.83800e-01,
         4.78000e+01,  1.63000e-01],
       ...,
       [ 5.63820e+00,  1.43618e+01, -5.68550e+03, ..., -9.93400e-01,
         2.79001e+01,  1.18500e-01],
       [-3.18180e+00,  2.31818e+01, -5.28100e+03, ..., -9.93400e-01,
         5.25999e+01,  8.77000e-02],
       [ 3.06820e+00,  1.69318e+01, -5.72700e+03, ..., -5.67700e-01,
         2.20000e+01,  8.77000e-02]])

KMEANS METHOD¶

In [ ]:
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(Xk_reset)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
No description has been provided for this image
In [ ]:
kmeans_k = KMeans(n_clusters = 2, init = 'k-means++', random_state = 42)
yk_kmeans = kmeans_k.fit_predict(Xk_reset)
In [ ]:
pd.Series(yk_kmeans).value_counts()
Out[ ]:
count
0 1035
1 358

In [ ]:
plt.scatter(Xk_reset[yk_kmeans == 0, 0], Xk_reset[yk_kmeans == 0, 1], s = 15, c = 'lightskyblue', label = 'Cluster 1')
plt.scatter(Xk_reset[yk_kmeans == 1, 0], Xk_reset[yk_kmeans == 1, 1], s = 15, c = 'orange', label = 'Cluster 2')

plt.scatter(kmeans_k.cluster_centers_[:, 0], kmeans_k.cluster_centers_[:, 1], s = 20, c = 'red', label = 'Centroids')
plt.title('Clusters of the test')
plt.xlabel('Component 1')
plt.ylabel('Component 1')
plt.legend()
plt.show()
No description has been provided for this image
In [ ]:
Xk_scaled = scaler.fit_transform(X_k)
Xk_scaled_df = pd.DataFrame(Xk_scaled, columns=X_k.columns)

# --- Step 1: Apply 2D t-SNE ---
# n_components=2 for 2D visualization
tsne_2d = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Xk_tsne_2d = tsne_2d.fit_transform(Xk_scaled_df)

# Convert to DataFrame for easier plotting
Xk_tsne_df = pd.DataFrame(data = Xk_tsne_2d, columns = ['tSNE_1', 'tSNE_2'])
Xk_tsne_df['KMeans_Cluster'] = yk_kmeans # Add cluster labels

# --- Step 2: Plot the 2D t-SNE results ---
plt.figure(figsize=(8, 6))

for cluster in sorted(Xk_tsne_df['KMeans_Cluster'].unique()):
    subset = Xk_tsne_df[Xk_tsne_df['KMeans_Cluster'] == cluster]
    plt.scatter(
        subset['tSNE_1'],
        subset['tSNE_2'],
        s = 10,
        label = f'Cluster {cluster}'
    )

plt.title('K-Means Clusters Visualized by 2D t-SNE')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.legend()
plt.show()
No description has been provided for this image
In [ ]:
# Assuming X_scaled_df and y_kmeans are ready

# --- Step 1: Apply 3D t-SNE ---
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Xk_tsne_3d = tsne_3d.fit_transform(Xk_scaled_df)

# Convert to DataFrame
Xk_tsne_df_3d = pd.DataFrame(data = Xk_tsne_3d, columns = ['tSNE_1', 'tSNE_2', 'tSNE_3'])
Xk_tsne_df_3d['KMeans_Cluster'] = yk_kmeans

# Define the color for the clusters
cluster_colors = {
    0: 'orange',
    1: 'lightskyblue',
}

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

for cluster_label, color_name in cluster_colors.items():
    subset = Xk_tsne_df_3d[Xk_tsne_df_3d['KMeans_Cluster'] == cluster_label]

    ax.scatter(
        subset['tSNE_1'],
        subset['tSNE_2'],
        subset['tSNE_3'],
        s=10,
        c=color_name,  # <<< Use the specific color here
        label=f'Cluster {cluster_label}',
        alpha=0.7
    )

ax.set_title('K-Means Clusters Visualized by 3D t-SNE')
ax.set_xlabel('t-SNE Component 1')
ax.set_ylabel('t-SNE Component 2')
ax.set_zlabel('t-SNE Component 3')
ax.legend()
plt.show()
No description has been provided for this image
In [ ]:
# Apply 3D t-SNE
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Xk_tsne_3d = tsne_3d.fit_transform(Xk_scaled_df)

# Convert to DataFrame
Xk_tsne_df_3d = pd.DataFrame(data = Xk_tsne_3d, columns = ['tSNE_1', 'tSNE_2', 'tSNE_3'])
Xk_tsne_df_3d['KMeans_Cluster'] = yk_kmeans

# Define the color for the clusters
cluster_colors = {
    0: 'lightskyblue',
    1: 'orange',
}

# Ensure the cluster column is treated as a string/categorical for distinct coloring
Xk_tsne_df_3d['KMeans_Cluster_Str'] = Xk_tsne_df_3d['KMeans_Cluster'].astype(str)

# Define the color of the Kmeans cluster
specific_plotly_colors = {
    '0': 'lightskyblue',
    '1': 'orange',
}

# Create the interactive 3D scatter plot using Plotly Express
fig = px.scatter_3d(
    Xk_tsne_df_3d,
    x='tSNE_1',
    y='tSNE_2',
    z='tSNE_3',
    color='KMeans_Cluster_Str',         # Color points by cluster
    color_discrete_map=specific_plotly_colors, # <<< Assign specific colors
    title='Interactive 3D K-Means Clusters (Plotly)',
    opacity=0.7,                        # Transparency level
    size_max=5,                         # <<< This controls the dot size (max diameter in pixels)
    labels={'tSNE_1': 't-SNE Component 1',
            'tSNE_2': 't-SNE Component 2',
            'tSNE_3': 't-SNE Component 3'}
)

# You can further customize the marker size and styling if needed:
fig.update_traces(marker=dict(size=3)) # <<< Sets a fixed size for all markers

fig.show() # This displays the rotatable plot in Google Colab

DBSCAN and t-SNE¶

In [ ]:
# Import necessary libraries
from sklearn.cluster import DBSCAN
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE

# 1. --- FIT DBSCAN MODEL ---
min_samples_start_k = int(np.log(Xk_scaled_df.shape[0]))
eps_start = 3 # Start with a large value and tune down, or use a k-distance plot.

dbscan = DBSCAN(eps=eps_start, min_samples=min_samples_start_k)
secom_clusters = dbscan.fit_predict(Xk_scaled_df)

print(f"DBSCAN Cluster Labels: {np.unique(secom_clusters)}")
print(f"Cluster Distribution (Tuning Required):\n{pd.Series(secom_clusters).value_counts()}")

# 2. --- PREPARE FOR VISUALIZATION (using t-SNE) ---
# DBSCAN was fit on the high-dimensional data, but we need 2D data for a plot.
# We will use t-SNE for dimensionality reduction for visualization.
tsne_2d = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Xk_tsne_2d = tsne_2d.fit_transform(Xk_scaled_df)

# Create a DataFrame for plotting
df_secom = pd.DataFrame(Xk_tsne_2d, columns=['tSNE_1', 'tSNE_2'])
df_secom['Cluster'] = secom_clusters

# 3. --- PLOT THE DBSCAN CLUSTERS (using t-SNE components) ---
plt.figure(figsize=(10, 8))
plt.scatter(
    df_secom['tSNE_1'],
    df_secom['tSNE_2'],
    c=df_secom['Cluster'],
    cmap='viridis', # 'viridis' often works well for many colors
    s=10,          # Minimized dot size
    alpha=0.7
)
plt.title(f'DBSCAN Clusters on SECOM Data (Visualized with 2D t-SNE)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.colorbar(label='Cluster Label (-1 is Noise)')
plt.show()
DBSCAN Cluster Labels: [-1  0  1]
Cluster Distribution (Tuning Required):
 0    1287
-1      97
 1       9
Name: count, dtype: int64
No description has been provided for this image
In [ ]:
from mpl_toolkits.mplot3d import Axes3D # REQUIRED for 3D projection

# --- 1. FIT DBSCAN MODEL (No change required here) ---
# Assuming Xk_scaled_df is your high-dimensional, scaled feature data
min_samples_start_k = int(np.log(Xk_scaled_df.shape[0]))
eps_start = 3

dbscan = DBSCAN(eps=eps_start, min_samples=min_samples_start_k)
secom_clusters = dbscan.fit_predict(Xk_scaled_df)

print(f"DBSCAN Cluster Labels: {np.unique(secom_clusters)}")
print(f"Cluster Distribution (Tuning Required):\n{pd.Series(secom_clusters).value_counts()}")

# 2. --- PREPARE FOR VISUALIZATION (using 3D t-SNE) ---
# Change n_components from 2 to 3
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Xk_tsne_3d = tsne_3d.fit_transform(Xk_scaled_df)

# Create a DataFrame for plotting, adding the third component
df_secom_3d = pd.DataFrame(Xk_tsne_3d, columns=['tSNE_1', 'tSNE_2', 'tSNE_3'])
df_secom_3d['Cluster'] = secom_clusters

# 3. --- PLOT THE DBSCAN CLUSTERS (using 3D t-SNE components) ---

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d') # <<< KEY CHANGE: Adds 3D axis projection

# Plotting the clusters
scatter = ax.scatter(
    df_secom_3d['tSNE_1'],
    df_secom_3d['tSNE_2'],
    df_secom_3d['tSNE_3'], # <<< ADDING THE THIRD COMPONENT
    c=df_secom_3d['Cluster'],
    cmap='viridis',
    s=10,
    alpha=0.7
)

ax.set_title(f'DBSCAN Clusters on SECOM Data (Visualized with 3D t-SNE)')
ax.set_xlabel('t-SNE Component 1')
ax.set_ylabel('t-SNE Component 2')
ax.set_zlabel('t-SNE Component 3') # <<< ADDING Z-AXIS LABEL

# To display the color bar correctly for 3D Matplotlib
plt.colorbar(scatter, label='Cluster Label (-1 is Noise)')
plt.show()
DBSCAN Cluster Labels: [-1  0  1]
Cluster Distribution (Tuning Required):
 0    1287
-1      97
 1       9
Name: count, dtype: int64
No description has been provided for this image

4. DATA IMBALANCE by Autoencoders combined with Data Sampling¶

In [ ]:
# PyTorch
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.utils.data import DataLoader, TensorDataset, random_split

# Scikit-learn
from sklearn.metrics import (
    confusion_matrix,
    classification_report,
    average_precision_score,
    precision_recall_curve,
    PrecisionRecallDisplay,
    ConfusionMatrixDisplay
)

# Sampling methods
from imblearn.over_sampling import SMOTE, RandomOverSampler, ADASYN
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN, SMOTETomek

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# System and utilities
import warnings
warnings.filterwarnings('ignore')

Autoencoder¶

In [ ]:
# =========================================================================
# 0. REPRODUCIBILITY: SET SEEDS
# =========================================================================
SEED = 42  # pick any integer you like

random.seed(SEED)
np.random.seed(SEED)
torch.manual_seed(SEED)

if torch.cuda.is_available():
    torch.cuda.manual_seed(SEED)
    torch.cuda.manual_seed_all(SEED)

# (Optional but recommended for strict reproducibility)
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

# =========================================================================
# 1. DATA PREPARATION (SECOM specific changes)
# =========================================================================

# --- Assuming data is loaded and available: ---
# Xk_scaled: NumPy array of SECOM features (N samples x 590 features)
# y_secom_target: NumPy array of 'Pass/Fail' labels (used for coloring the plot)
Xk_scaled = scaler.fit_transform(X_k)

# X_secom_scaled_array = scaler.fit_transform(X)
y_secom_target = secom_upd['Pass/Fail'].values

# NOTE: Load your cleaned, scaled SECOM feature array (Xk_scaled)
# and the target array (y_secom_target) here.
# We estimate the input dimension based on your previous steps (~590 features).
INPUT_DIM = Xk_scaled.shape[1]
LATENT_DIM = 10
print(f"Detected Input Dimension from data: {INPUT_DIM}") # Check the output!

# Convert to tensor and create DataLoader
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
X_tensor = torch.tensor(Xk_scaled, dtype=torch.float32).to(device)
dataset = torch.utils.data.TensorDataset(X_tensor)
loader = torch.utils.data.DataLoader(dataset, batch_size=256, shuffle=True)

g = torch.Generator()
g.manual_seed(SEED)

loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=256,
    shuffle=True,
    generator=g  # PyTorch 1.7+ supports this
)

# =========================================================================
# 2. MODEL DEFINITION (Scaled for High Dimensionality)
# =========================================================================

class DeepAutoencoderSECOM(nn.Module):
    def __init__(self, input_dim=INPUT_DIM, latent_dim=LATENT_DIM):
        super(DeepAutoencoderSECOM, self).__init__()

        # Scaling layers for 590+ features, creating a wider funnel
        hidden_1 = 512
        hidden_2 = 128
        hidden_3 = 64

        # Encoder: 4 layers reducing down to the 2D bottleneck
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, hidden_1), # 589 -> 512
            nn.ReLU(),
            nn.Linear(hidden_1, hidden_2),  # 512 -> 128
            nn.ReLU(),
            nn.Linear(hidden_2, hidden_3),  # 128 -> 64
            nn.ReLU(),
            nn.Linear(hidden_3, latent_dim) # 64 -> 2D bottleneck
        )

        # Decoder: 4 layers reconstructing the input
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, hidden_3),  # 8 -> 64
            nn.ReLU(),
            nn.Linear(hidden_3, hidden_2),   # 64 -> 128
            nn.ReLU(),
            nn.Linear(hidden_2, hidden_1),   # 128 -> 512
            nn.ReLU(),
            nn.Linear(hidden_1, input_dim)   # 512 -> 589
        )

    def forward(self, x):
        z = self.encoder(x)
        x_hat = self.decoder(z)
        return z, x_hat

    def encode(self, x):
        """Method to explicitly get the latent representation Z."""
        return self.encoder(x)


# =========================================================================
# 3. TRAINING & PLOTTING
# =========================================================================

# Initialize model, optimizer, and loss function
model = DeepAutoencoderSECOM(input_dim=INPUT_DIM, latent_dim=LATENT_DIM).to(device)
optimizer = optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.MSELoss()

# Training loop
epochs = 50 # Reduced epochs for faster execution/testing. Increase this for final training.
model.train()
print(f"Starting training on {INPUT_DIM} features with {epochs} epochs...")
for epoch in range(epochs):
    total_loss = 0
    for batch in loader:
        batch_x = batch[0]
        optimizer.zero_grad()
        z, x_hat = model(batch_x)
        loss = criterion(x_hat, batch_x)
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    avg_loss = total_loss / len(loader)
    if (epoch + 1) % 5 == 0:
        print(f"Epoch {epoch+1}/{epochs}, Loss={avg_loss:.6f}")

# Extract latent representation from encoder
model.eval()
with torch.no_grad():
    Z, _ = model(X_tensor)
    Z_np = Z.cpu().numpy()

# Plot latent 2D bottleneck space, colored by Pass/Fail target
plt.figure(figsize=(8, 7))

# Use the original SECOM labels (y_secom_target) for coloring
# Assuming 1 is 'Fail' and -1 is 'Pass' (based on your initial image)
# We use a custom cmap for clarity
plt.scatter(Z_np[:, 0], Z_np[:, 1], c=y_secom_target,
            cmap='coolwarm', s=10, alpha=0.7)

plt.title(f"SECOM Data - Deep Autoencoder Latent Space ({LATENT_DIM}D)")
plt.xlabel("Latent Component 1")
plt.ylabel("Latent Component 2")
plt.colorbar(label="Pass/Fail Label (-1 or 1)")
plt.grid(True)
plt.show()
Detected Input Dimension from data: 20
Starting training on 20 features with 50 epochs...
Epoch 5/50, Loss=0.569383
Epoch 10/50, Loss=0.360770
Epoch 15/50, Loss=0.228854
Epoch 20/50, Loss=0.153011
Epoch 25/50, Loss=0.121260
Epoch 30/50, Loss=0.100450
Epoch 35/50, Loss=0.091366
Epoch 40/50, Loss=0.077644
Epoch 45/50, Loss=0.073149
Epoch 50/50, Loss=0.060785
No description has been provided for this image
In [ ]:
# =========================================================================
# 4. Calculate Final Reconstruction Error on the entire dataset
# =========================================================================

model.eval()
with torch.no_grad():
    # Pass the entire dataset (X_tensor) through the model
    _, X_hat_final = model(X_tensor)

    # Calculate the MSE loss between the final reconstruction and the original data
    final_reconstruction_error = criterion(X_hat_final, X_tensor).item()

print(f"\nFinal Reconstruction MSE Error on Full Dataset: {final_reconstruction_error:.6f}")
Final Reconstruction MSE Error on Full Dataset: 0.058626

Extract the latent variables from Autoencoder¶

In [ ]:
Z, _ = model(X_tensor)
# Conversion to Numpy Array
Z_np = Z.detach().cpu().numpy()
print(Z_np)
Z_np.shape
[[-1.386474   -0.90422934 -0.30455133 ...  0.4420718   1.2684146
   0.44053203]
 [-2.319976   -0.40148574 -2.726464   ...  5.366594   -0.46131426
   3.8805861 ]
 [ 0.37873235 -2.282571   -2.041797   ... -1.9194677   3.1717741
  -3.6053414 ]
 ...
 [-0.6914102  -2.5044188  -0.21591204 ... -0.3063707   3.9818249
   0.9781686 ]
 [-0.42631987 -2.4065845  -0.10155663 ...  0.7870069   3.729516
   0.15664092]
 [-0.9364726  -2.5117254   0.86905694 ... -1.5053236   3.3884928
   0.8203121 ]]
Out[ ]:
(1393, 10)

SMOTE Logistic Regression¶

In [ ]:
# --- ASSUMED DATA PREPARATION ---
# X_df is the 8 latent variables from Autoencoder output
X_df = pd.DataFrame(Z_np, index=secom_upd.index) # Convert NumPy array Z_np to DataFrame
# y_arr is the binary target array (Pass=-1 -> 0, Fail=1 -> 1)
y_arr = (secom_upd['Pass/Fail'] == 1).astype(int).values

RANDOM_STATE = 42
THRESHOLD = 0.5
STAGE_NAME = "4th Stage: Autoencoder Latent Z (10 Features)"

print(f"======== {STAGE_NAME} ({X_df.shape[1]} features) ========")

# 1. Split Data (Stratified)
X_train, X_test, y_train, y_test = train_test_split(
    X_df, y_arr, test_size=0.3, random_state=RANDOM_STATE, stratify=y_arr
)

# 2. Preprocessing Pipeline: Impute (Median) -> Scale
# Scaling is particularly important for Latent Variables
preprocessor = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

X_train_scaled = preprocessor.fit_transform(X_train)
X_test_scaled = preprocessor.transform(X_test)

# 3. SMOTE on Training Data
smote = SMOTE(sampling_strategy='minority', random_state=RANDOM_STATE)
X_train_resampled, y_train_resampled = smote.fit_resample(X_train_scaled, y_train)

print(f"Train size: {len(y_train)} -> Resampled size: {len(y_train_resampled)}")

# 4. Logistic Regression Model
model = LogisticRegression(solver='liblinear', penalty='l2', random_state=RANDOM_STATE)
model.fit(X_train_resampled, y_train_resampled)

# 5. Prediction and Evaluation (on Test Set)
y_pred_proba = model.predict_proba(X_test_scaled)[:, 1]
y_pred = (y_pred_proba >= THRESHOLD).astype(int)

# Metrics
f1 = f1_score(y_test, y_pred)
auprc = average_precision_score(y_test, y_pred_proba)
cm = confusion_matrix(y_test, y_pred)

print("\n--- Performance Metrics (Test Set) ---")
print(f"F1-score: {f1:.4f}")
print(f"AUPRC:    {auprc:.4f}")

# 6. Plotting: Confusion Matrix & PR Curve
plt.figure(figsize=(12, 5))

# Confusion Matrix
plt.subplot(1, 2, 1)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['Pass (0)', 'Fail (1)'])
disp.plot(ax=plt.gca(), cmap='Blues', values_format='d')
plt.title(f"Confusion Matrix\n({STAGE_NAME})")

plt.grid(False)

# Precision-Recall Curve
p, r, _ = precision_recall_curve(y_test, y_pred_proba)
baseline = np.sum(y_test) / len(y_test)

plt.subplot(1, 2, 2)
plt.plot(r, p, label=f'Test Set (AUPRC={auprc:.4f})')
plt.plot([0, 1], [baseline, baseline], linestyle='--', color='red', label=f'Baseline (AP={baseline:.4f})')
plt.xlabel("Recall")
plt.ylabel("Precision")
plt.title("Precision-Recall Curve (Test Set)")

plt.legend()
plt.grid(True)
plt.tight_layout()
plt.show()
======== 4th Stage: Autoencoder Latent Z (10 Features) (10 features) ========
Train size: 975 -> Resampled size: 1812

--- Performance Metrics (Test Set) ---
F1-score: 0.2238
AUPRC:    0.1192
No description has been provided for this image

Apply cluster method for 10 latent variables¶

Elbow¶

In [ ]:
from sklearn.cluster import KMeans
wcss = []
for i in range(1, 11):
    kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
    kmeans.fit(Z_np)
    wcss.append(kmeans.inertia_)
plt.plot(range(1, 11), wcss)
plt.title('The Elbow Method')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.show()
No description has been provided for this image

Kmeans¶

In [ ]:
kmeans = KMeans(n_clusters = 2, init = 'k-means++', random_state = 19)
y_kmeans = kmeans.fit_predict(Z_np)
pd.Series(y_kmeans).value_counts()
Out[ ]:
count
1 1356
0 37

In [ ]:
plt.scatter(Z_np[y_kmeans == 0, 0], Z_np[y_kmeans == 0, 1], s = 15, c = 'lightskyblue', label = 'Cluster 1')
plt.scatter(Z_np[y_kmeans == 1, 0], Z_np[y_kmeans == 1, 1], s = 15, c = 'orange', label = 'Cluster 2')

plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s = 12, c = 'red', label = 'Centroids')
plt.title('Clusters of the test')
plt.xlabel('Component 1')
plt.ylabel('Component 1')
plt.legend()
plt.show()
No description has been provided for this image

Kmeans and t-SNE¶

In [ ]:
Z_scaled = scaler.fit_transform(Z_np)
Z_scaled_df = pd.DataFrame(Z_scaled)

# --- Step 1: Apply 2D t-SNE ---
# n_components=2 for 2D visualization
tsne_2d = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Z_tsne_2d = tsne_2d.fit_transform(Z_scaled_df)

# Convert to DataFrame for easier plotting
Z_tsne_df = pd.DataFrame(data = Z_tsne_2d, columns = ['tSNE_1', 'tSNE_2'])
Z_tsne_df['KMeans_Cluster'] = y_kmeans # Add cluster labels

# --- Step 2: Plot the 2D t-SNE results ---
plt.figure(figsize=(8, 6))

for cluster in sorted(Z_tsne_df['KMeans_Cluster'].unique()):
    subset = Z_tsne_df[Z_tsne_df['KMeans_Cluster'] == cluster]
    plt.scatter(
        subset['tSNE_1'],
        subset['tSNE_2'],
        s = 10,
        label = f'Cluster {cluster}'
    )

plt.title('K-Means Clusters Visualized by 2D t-SNE')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.legend()
plt.show()
No description has been provided for this image
In [ ]:
# Assuming X_scaled_df and y_kmeans are ready

# --- Step 1: Apply 3D t-SNE ---
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Z_tsne_3d = tsne_3d.fit_transform(Z_scaled_df)

# Convert to DataFrame
Z_tsne_df_3d = pd.DataFrame(data = Z_tsne_3d, columns = ['tSNE_1', 'tSNE_2', 'tSNE_3'])
Z_tsne_df_3d['KMeans_Cluster'] = y_kmeans

# Define the color for the clusters
cluster_colors = {
    0: 'lightskyblue',
    1: 'orange',
}

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d')

for cluster_label, color_name in cluster_colors.items():
    subset = X_tsne_df_3d[Z_tsne_df_3d['KMeans_Cluster'] == cluster_label]

    ax.scatter(
        subset['tSNE_1'],
        subset['tSNE_2'],
        subset['tSNE_3'],
        s=10,
        c=color_name,  # <<< Use the specific color here
        label=f'Cluster {cluster_label}',
        alpha=0.7
    )

ax.set_title('K-Means Clusters Visualized by 3D t-SNE')
ax.set_xlabel('t-SNE Component 1')
ax.set_ylabel('t-SNE Component 2')
ax.set_zlabel('t-SNE Component 3')
ax.legend()
plt.show()
No description has been provided for this image
In [ ]:
# Apply 3D t-SNE
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Z_tsne_3d = tsne_3d.fit_transform(Z_scaled_df)

# Convert to DataFrame
Z_tsne_df_3d = pd.DataFrame(data = Z_tsne_3d, columns = ['tSNE_1', 'tSNE_2', 'tSNE_3'])
Z_tsne_df_3d['KMeans_Cluster'] = y_kmeans

# Define the color for the clusters
cluster_colors = {
    0: 'lightskyblue',
    1: 'orange',
}

# Ensure the cluster column is treated as a string/categorical for distinct coloring
Z_tsne_df_3d['KMeans_Cluster_Str'] = Z_tsne_df_3d['KMeans_Cluster'].astype(str)

# Define the color of the Kmeans cluster
specific_plotly_colors = {
    '0': 'lightskyblue',
    '1': 'orange',
}

# Create the interactive 3D scatter plot using Plotly Express
fig = px.scatter_3d(
    Z_tsne_df_3d,
    x='tSNE_1',
    y='tSNE_2',
    z='tSNE_3',
    color='KMeans_Cluster_Str',         # Color points by cluster
    color_discrete_map=specific_plotly_colors, # <<< Assign specific colors
    title='Interactive 3D K-Means Clusters (Plotly)',
    opacity=0.7,                        # Transparency level
    size_max=5,                         # <<< This controls the dot size (max diameter in pixels)
    labels={'tSNE_1': 't-SNE Component 1',
            'tSNE_2': 't-SNE Component 2',
            'tSNE_3': 't-SNE Component 3'}
)

# You can further customize the marker size and styling if needed:
fig.update_traces(marker=dict(size=3)) # <<< Sets a fixed size for all markers

fig.show() # This displays the rotatable plot in Google Colab

DBSCAN AND t-SNE¶

In [ ]:
# 1. --- FIT DBSCAN MODEL ---
min_samples_start_k = int(np.log(Z_scaled_df.shape[0]))
eps_start = 2 # Start with a large value and tune down, or use a k-distance plot.

dbscan = DBSCAN(eps=eps_start, min_samples=min_samples_start_k)
secom_clusters = dbscan.fit_predict(Z_scaled_df)

print(f"DBSCAN Cluster Labels: {np.unique(secom_clusters)}")
print(f"Cluster Distribution (Tuning Required):\n{pd.Series(secom_clusters).value_counts()}")

# 2. --- PREPARE FOR VISUALIZATION (using t-SNE) ---
# DBSCAN was fit on the high-dimensional data, but we need 2D data for a plot.
# We will use t-SNE for dimensionality reduction for visualization.
tsne_2d = TSNE(n_components=2, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Z_tsne_2d = tsne_2d.fit_transform(Z_scaled_df)

# Create a DataFrame for plotting
df_secom = pd.DataFrame(Z_tsne_2d, columns=['tSNE_1', 'tSNE_2'])
df_secom['Cluster'] = secom_clusters

# 3. --- PLOT THE DBSCAN CLUSTERS (using t-SNE components) ---
plt.figure(figsize=(10, 8))
plt.scatter(
    df_secom['tSNE_1'],
    df_secom['tSNE_2'],
    c=df_secom['Cluster'],
    cmap='viridis', # 'viridis' often works well for many colors
    s=10,          # Minimized dot size
    alpha=0.7
)
plt.title(f'DBSCAN Clusters on SECOM Data (Visualized with 2D t-SNE)')
plt.xlabel('t-SNE Component 1')
plt.ylabel('t-SNE Component 2')
plt.colorbar(label='Cluster Label (-1 is Noise)')
plt.show()
DBSCAN Cluster Labels: [-1  0  1]
Cluster Distribution (Tuning Required):
 0    1304
-1      80
 1       9
Name: count, dtype: int64
No description has been provided for this image
In [ ]:
# --- 1. FIT DBSCAN MODEL (No change required here) ---
# Assuming Xk_scaled_df is your high-dimensional, scaled feature data
min_samples_start = int(np.log(Z_scaled_df.shape[0]))
eps_start = 2

dbscan = DBSCAN(eps=eps_start, min_samples=min_samples_start)
secom_clusters = dbscan.fit_predict(Z_scaled_df)

print(f"DBSCAN Cluster Labels: {np.unique(secom_clusters)}")
print(f"Cluster Distribution (Tuning Required):\n{pd.Series(secom_clusters).value_counts()}")

# 2. --- PREPARE FOR VISUALIZATION (using 3D t-SNE) ---
# Change n_components from 2 to 3
tsne_3d = TSNE(n_components=3, perplexity=30, learning_rate='auto', random_state=42, n_jobs=-1)
Z_tsne_3d = tsne_3d.fit_transform(Z_scaled_df)

# Create a DataFrame for plotting, adding the third component
df_secom_3d = pd.DataFrame(Z_tsne_3d, columns=['tSNE_1', 'tSNE_2', 'tSNE_3'])
df_secom_3d['Cluster'] = secom_clusters

# 3. --- PLOT THE DBSCAN CLUSTERS (using 3D t-SNE components) ---

fig = plt.figure(figsize=(10, 8))
ax = fig.add_subplot(111, projection='3d') # <<< KEY CHANGE: Adds 3D axis projection

# Plotting the clusters
scatter = ax.scatter(
    df_secom_3d['tSNE_1'],
    df_secom_3d['tSNE_2'],
    df_secom_3d['tSNE_3'], # <<< ADDING THE THIRD COMPONENT
    c=df_secom_3d['Cluster'],
    cmap='viridis',
    s=10,
    alpha=0.7
)

ax.set_title(f'DBSCAN Clusters on SECOM Data (Visualized with 3D t-SNE)')
ax.set_xlabel('t-SNE Component 1')
ax.set_ylabel('t-SNE Component 2')
ax.set_zlabel('t-SNE Component 3') # <<< ADDING Z-AXIS LABEL

# To display the color bar correctly for 3D Matplotlib
plt.colorbar(scatter, label='Cluster Label (-1 is Noise)')
plt.show()
DBSCAN Cluster Labels: [-1  0  1]
Cluster Distribution (Tuning Required):
 0    1304
-1      80
 1       9
Name: count, dtype: int64
No description has been provided for this image